{"title": "Convolutional Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2849, "page_last": 2858, "abstract": "We present a practical way of introducing convolutional structure into Gaussian processes, making them more suited to high-dimensional inputs like images. The main contribution of our work is the construction of an inter-domain inducing point approximation that is well-tailored to the convolutional kernel. This allows us to gain the generalisation benefit of a convolutional kernel, together with fast but accurate posterior inference. We investigate several variations of the convolutional kernel, and apply it to MNIST and CIFAR-10, where we obtain significant improvements over existing Gaussian process models. We also show how the marginal likelihood can be used to find an optimal weighting between convolutional and RBF kernels to further improve performance. This illustration of the usefulness of the marginal likelihood may help automate discovering architectures in larger models.", "full_text": "Convolutional Gaussian Processes\n\nMark van der Wilk\n\nDepartment of Engineering\nUniversity of Cambridge, UK\n\nmv310@cam.ac.uk\n\nCarl Edward Rasmussen\nDepartment of Engineering\nUniversity of Cambridge, UK\n\ncer54@cam.ac.uk\n\nAbstract\n\nJames Hensman\n\nprowler.io\n\nCambridge, UK\n\njames@prowler.io\n\nWe present a practical way of introducing convolutional structure into Gaussian\nprocesses, making them more suited to high-dimensional inputs like images. The\nmain contribution of our work is the construction of an inter-domain inducing point\napproximation that is well-tailored to the convolutional kernel. This allows us to\ngain the generalisation bene\ufb01t of a convolutional kernel, together with fast but\naccurate posterior inference. We investigate several variations of the convolutional\nkernel, and apply it to MNIST and CIFAR-10, where we obtain signi\ufb01cant improve-\nments over existing Gaussian process models. We also show how the marginal\nlikelihood can be used to \ufb01nd an optimal weighting between convolutional and\nRBF kernels to further improve performance. This illustration of the usefulness\nof the marginal likelihood may help automate discovering architectures in larger\nmodels.\n\n1\n\nIntroduction\n\nGaussian processes (GPs) [1] can be used as a \ufb02exible prior over functions, which makes them an\nelegant building block in Bayesian nonparametric models. In recent work, there has been much\nprogress in addressing the computational issues preventing GPs from scaling to large problems\n[2, 3, 4, 5]. However, orthogonal to being able to algorithmically handle large quantities of data is the\nquestion of how to build GP models that generalise well. The properties of a GP prior, and hence its\nability to generalise in a speci\ufb01c problem, are fully encoded by its covariance function (or kernel).\nMost common kernel functions rely on rather rudimentary and local metrics for generalisation, like\nthe Euclidean distance. This has been widely criticised, notably by Bengio [6], who argued that deep\narchitectures allow for more non-local generalisation. While deep architectures have seen enormous\nsuccess in recent years, it is an interesting research question to investigate what kind of non-local\ngeneralisation structures can be encoded in shallow structures like kernels, while preserving the\nelegant properties of GPs.\nConvolutional structures have non-local in\ufb02uence and have successfully been applied in neural\nnetworks to improve generalisation for image data [see e.g. 7, 8]. In this work, we investigate\nhow Gaussian processes can be equipped with convolutional structures, together with accurate\napproximations that make them applicable in practice. A previous approach by Wilson et al. [9]\ntransforms the inputs to a kernel using a convolutional neural network. This produces a valid kernel\nsince applying a deterministic transformation to kernel inputs results in a valid kernel [see e.g. 1, 10],\nwith the (many) parameters of the transformation becoming kernel hyperparameters. We stress that\nour approach is different in that the process itself is convolved, which does not require the introduction\nof additional parameters. Although our method does have inducing points that play a similar role\nto the \ufb01lters in a convolutional neural network (convnet), these are variational parameters and are\ntherefore more protected from over-\ufb01tting.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Background\n\nInterest in Gaussian processes in the machine learning community started with the realisation that\na shallow but in\ufb01nitely wide neural network with Gaussian weights was a Gaussian process [11] \u2013\na nonparametric model with analytically tractable posteriors and marginal likelihoods. This gives\ntwo main desirable properties. Firstly, the posterior gives uncertainty estimates, which, combined\nwith having an in\ufb01nite number of basis functions, results in sensibly large uncertainties far from\nthe data (see Qui\u00f1onero-Candela and Rasmussen [12, \ufb01g. 5] for a useful illustration). Secondly,\nthe marginal likelihood can be used to select kernel hyperparameters. The main drawback is an\n\nO(cid:0)N 3(cid:1) computational cost for N observations. Because of this, much attention over recent years\ncan simultaneously give a computational speed-up to O(cid:0)N M 2(cid:1) (with M (cid:28) N) through sparse\n\nhas been devoted to scaling GP inference to large datasets through sparse approximations [2, 13, 14],\nminibatch-based optimisation [3], exploiting structure in the covariance matrix [e.g. 15] and Fourier\nmethods [16, 17].\nIn this work, we adopt the variational framework for approximation in GP models, because it\n\napproximations [2] and approximate posteriors due to non-Gaussian likelihoods [18]. The variational\nchoice is both elegant and practical: it can be shown that the variational objective minimises the\nKL divergence across the entire latent process [4, 19], which guarantees that the exact model will\nbe approximated given enough resources. Other methods, such as EP/FITC [14, 20, 21, 22], can\nbe seen as approximate models that do not share this property, leading to behaviour that would not\nbe expected from the model that is to be approximated [23]. It is worth noting however, that our\nmethod for convolutional GPs is not speci\ufb01c to the variational framework, and can be used without\nmodi\ufb01cation with other objective functions, such as variations on EP.\n\n2.1 Gaussian variational approximation\n\nWe adopt the popular choice of combining a sparse GP approximation with a Gaussian assumption,\nusing a variational objective as introduced in [24]. We choose our model to be\n\nf (\u00b7)| \u03b8 \u223c GP (0, k(\u00b7,\u00b7)) ,\nyi | f, xi\niid\u223c p(yi | f (xi)) ,\n\n(1)\n\n(5)\n\n(6)\n(7)\n(8)\n\n(2)\nwhere p(yi | f (xi)) is some non-Gaussian likelihood, for example a Bernoulli distribution through a\nprobit link function for classi\ufb01cation. The kernel parameters \u03b8 are to be estimated by approximate\nmaximum likelihood, and we drop them from the notation hereon. Following Titsias [2], we choose\nthe approximate posterior to be a GP with its marginal distribution speci\ufb01ed at M \u201cinducing inputs\u201d\nZ = {zm}M\nm=1, the approximate\nposterior process is constructed from the speci\ufb01ed marginal and the prior conditional1:\n\nm=1. Denoting the value of the GP at those points as u = {f (zm)}M\n\n(3)\n(4)\nThe vector-valued function ku(\u00b7) gives the covariance between u and the remainder of f, and is\nconstructed from the kernel: ku(\u00b7) = [k(zm,\u00b7)]M\nm=1. The matrix Kuu is the prior covariance of u.\nThe variational parameters m, S and Z are then optimised with respect to the evidence lower bound\n(ELBO):\n\nuuu, k(\u00b7,\u00b7) \u2212 ku(\u00b7)(cid:62)K\u22121\n\nELBO =\n\nEq(f (xi)) [log p(yi | f (xi))] \u2212 KL[q(u)||p(u)] .\n\nu \u223c N(cid:0)m, S(cid:1) ,\n\nf (\u00b7)| u \u223c GP(cid:0)ku(\u00b7)(cid:62)K\u22121\n\nuuku(\u00b7)(cid:1) .\n\n(cid:88)\n\ni\n\nHere, q(u) is the density of u associated with equation (3), and p(u) is the prior density from (1).\nExpectations are taken with respect to the marginals of the posterior approximation, given by\n\nq(f (xi)) = N(cid:0)\u00b5i, \u03c32\n\n(cid:1) ,\n\ni\n\n\u00b5i = ku(xi)(cid:62)K\u22121\ni = k(xi, xi) + KfuK\u22121\n\u03c32\n\nuum ,\n\nuu(S \u2212 Kuu)K\u22121\n\nuuKuf .\n\n1The construction of the approximate posterior can alternatively be seen as a GP posterior to a regression\nproblem, where the q(u) indirectly speci\ufb01es the likelihood. Variational inference will then adjust the inputs and\nlikelihood of this regression problem to make the approximation close to the true posterior in KL divergence.\n\n2\n\n\fThe matrices Kuu and Kfu are obtained by evaluating the kernel as k(zm, zm(cid:48)) and k(xn, zm)\nrespectively. The KL divergence term of the ELBO is analytically tractable, whilst the expectation\nterm can be computed using one-dimensional quadrature. The form of the ELBO means that\nstochastic optimisation using minibatches is applicable. A full discussion of the methodology is\ngiven by Matthews [19]. We optimise the ELBO instead of the marginal likelihood to \ufb01nd the\nhyperparameters.\n\n2.2\n\nInter-domain variational GPs\n\nInter-domain Gaussian processes [25] work by replacing the variables u, which we have above\nassumed to be observations of the function at the inducing inputs Z, with more complicated variables\nmade by some linear operator on the function. Using linear operators ensures that the inducing\nvariables u are still jointly Gaussian with the other points on the GP. Implementing inter-domain\ninducing variables can therefore be a drop-in replacement to inducing points, requiring only that the\nappropriate (cross-)covariances Kfu and Kuu are used.\nThe key advantage of the inter-domain approach is that the approximate posterior mean\u2019s (7) effective\nbasis functions ku(\u00b7) can be manipulated by the linear operator which constructs u. This can make\nthe approximation more \ufb02exible, or give other computational bene\ufb01ts. For example, Hensman et al.\n[17] used the Fourier transform to construct u such that the Kuu matrix becomes easier to invert.\nInter-domain inducing variables are usually constructed using a weighted integral of the GP:\n\num =\n\n\u03c6(x; zm)f (x) dx ,\n\n(9)\n\nwhere the weighting function \u03c6 depends on some parameters zm. The covariance between the\ninducing variable um and a point on the function is then\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)(cid:90)\n\ncov(um, f (xn)) = k(zm, xn) =\n\n\u03c6(x; zm)k(x, xn) dx ,\n\nand the covariance between two inducing variables is\n\ncov(um, um(cid:48)) = k(zm, zm(cid:48)) =\n\n\u03c6(x; zm)\u03c6(x(cid:48); zm(cid:48))k(x, x(cid:48)) dx dx(cid:48) .\n\n(10)\n\n(11)\n\nUsing inter-domain inducing variables in the variational framework is straightforward if the above\nintegrals are tractable. The results are substituted for the kernel evaluations in equations (7) and (8).\nOur proposed method will be an inter-domain approximation in the sense that the inducing input\nspace is different from the input space of the kernel. However, instead of relying on an integral\ntransformation of the GP, we construct the inducing variables u alongside the new kernel such that\nthe effective basis functions contain a convolution operation.\n\n2.3 Additive GPs\n\nWe would like to draw attention to previously studied additive models [26, 27], in order to highlight\nthe similarity with the convolutional kernels we will introduce later. Additive models construct a\nprior GP as a sum of functions over subsets of the input dimensions, resulting in a kernel with the\nsame additive structure. For example, summing over each input dimension i, we get\n\nf (x) =\n\nfi(x[i]) =\u21d2 k(x, x(cid:48)) =\n\nki(x[i], x(cid:48)[i]) .\n\n(12)\n\n(cid:88)\n\n(cid:88)\n\ni\n\ni\n\nThis kernel exhibits some non-local generalisation, as the relative function values along one dimension\nwill be the same regardless of the input along other dimensions. In practice, this speci\ufb01c additive\nmodel is rather too restrictive to \ufb01t data well, since it assumes that all variables affect the response\ny independently. At the other extreme, the popular squared exponential kernel allows interactions\nbetween all dimensions, but this turns out to be not restrictive enough: for high-dimensional problems\nwe need to impose some restriction on the form of the function.\nIn this work, we build an additive kernel inspired by the convolution operator found in convnets.\nThe same function is applied to patches from the input, which allows adjacent pixels to interact, but\nimposes an additive structure otherwise.\n\n3\n\n\f3 Convolutional Gaussian Processes\n\nWe begin by constructing the exact convolutional Gaussian process model, highlighting its connections\nto existing neural network models, and challenges in performing inference.\n\nConvolutional kernel construction Our aim is to construct a GP prior on functions on images of\nsize D = W \u00d7 H to real valued responses: f : RD \u2192 R. We start with a patch-response function,\ng : RE \u2192 R, mapping from patches of size E. We use a stride of 1 to extract all patches, so for\npatches of size E = w \u00d7 h, we get a total of P = (W \u2212 w + 1) \u00d7 (H \u2212 h + 1) patches. We can\nstart by simply making the overall function f the sum of all patch responses. If g(\u00b7) is given a GP\nprior, a GP prior will also be induced on f (\u00b7):\n\ng \u223c GP (0, kg(z, z(cid:48))) ,\n\n\uf8eb\uf8ed0,\n\nP(cid:88)\n\nP(cid:88)\n\np=1\n\np(cid:48)=1\n\n=\u21d2 f \u223c GP\n\nx[p](cid:17)\n\nf (x) =\n\n(cid:16)\n(cid:88)\nx[p], x(cid:48)[p(cid:48)](cid:17)\uf8f6\uf8f8 ,\n\ng\n\np\n\n(cid:16)\n\nkg\n\n,\n\n(13)\n\n(14)\n\nwhere x[p] indicates the pth patch of the image x. This construction is reminiscent of the additive\nmodels discussed earlier, since a function is applied to subsets of the input. However, in this case, the\nsame function g(\u00b7) is applied to all input subsets. This allows all patches in the image to inform the\nvalue of the patch-response function, regardless of their location.\n\nComparison to convnets This approach is similar in spirit to convnets. Both methods start with a\nfunction that is applied to each patch. In the construction above, we introduce a single patch-response\nfunction g(\u00b7) that is non-linear and nonparametric. Convnets, on the other hand, rely on many linear\n\ufb01lters, followed by a non-linearity. The \ufb02exibility of a single convolutional layer is controlled by the\nnumber of \ufb01lters, while depth is important in order to allow for enough non-linearity. In our case,\nadding more non-linear \ufb01lters to the construction of f (\u00b7) does not increase the capacity to learn. The\npatch responses of the multiple \ufb01lters would be summed, resulting in simply a summed kernel for the\nprior over g.\n\nComputational issues Similar kernels have been proposed in various forms [28, 29], but have\nnever been applied directly in GPs, probably due to the prohibitive costs. Direct implementation\nof a GP using kf would be infeasible not only due to the usual cubic cost w.r.t. the number of data\npoints, but also due to it requiring P 2 evaluations of kg per element of K\ufb00 . For MNIST with patches\nof size 5, P 2 \u2248 3.3 \u00b7 105, resulting in the kernel evaluations becoming a signi\ufb01cant bottleneck.\nSparse inducing point methods require M 2 + N M kernel evaluations of kf . As an illustration, the\nKuu matrix for 750 inducing points (which we use in our experiments) would require \u223c 700 GB of\nmemory for backpropagation. Luckily, this can largely be avoided.\n\n4\n\nInducing patch approximations\n\nIn the next few sections, we will introduce several variants of the convolutional Gaussian process,\nand illustrate their properties using toy and real datasets. Our main contribution is showing that\nconvolutional structure can be embedded in kernels, and that they can be used within the framework\nof nonparametric Gaussian process approximations. We do so by constructing the kernel in tandem\nwith a suitable domain in which to place the inducing variables. Implementation2 requires minimal\nchanges to existing implementations of sparse variational GP inference, and can leverage GPU\nimplementations of convolution operations (see appendix). In the appendix we also describe how the\nsame inference method can be applied to kernels with general invariances.\n\n4.1 Translation invariant convolutional GP\n\nHere we introduce the simplest version of our method. We start with the construction from section\n3, with an RBF kernel for kg. In order to obtain a tractable method, we want to approximate the\n2Ours can be found on https://github.com/markvdw/convgp, together with code for replicating the\n\nexperiments, and trained models. It is based on GP\ufb02ow [30], allowing utilisation of GPUs.\n\n4\n\n\f(a) Rectangles dataset.\n\n(b) MNIST 0-vs-1 dataset.\n\nFigure 1: The optimised inducing patches for the translation invariant kernel. The inducing patches\nare sorted by the value of their corresponding inducing output, illustrating the evidence each patch\nhas in favour of a class.\n\n(cid:34)(cid:88)\n\n(cid:35)\n\n(cid:88)\n\n(cid:16)\n\n(cid:17)\n\ntrue posterior using a small set of inducing points. The main idea is to place these inducing points\nin the input space of patches, rather than images. This corresponds to using inter-domain inducing\npoints. In order to use this approximation we simply need to \ufb01nd the appropriate inter-domain (cross-)\ncovariances Kuu and Kfu, which are easily found from the construction of the convolutional kernel\nin equation 14:\n\nkf u(x, z) = Eg [f (x)g(z)] = Eg\n\ng(x[p])g(z)\n\n=\n\nkg\n\nx[p], z\n\n,\n\n(15)\n\np\n\np\n\nkuu(z, z(cid:48)) = Eg [g(z)g(z(cid:48))] = kg(z, z(cid:48)) .\n\n(16)\nThis improves on the computation from the standard inducing point method, since only covariances\nbetween the image patches and inducing patches are needed, allowing Kfu to be calculated with\nN M P instead of N M P 2 kernel evaluations. Since Kuu now only requires the covariances between\ninducing patches, its cost is M 2 instead of M 2P 2 evaluations. However, evaluating diag [K\ufb00 ] does\nstill require N P 2 evaluations, although N can be small when using minibatch optimisation. This\n\nbrings the cost of computing the kernel matrices down signi\ufb01cantly compared to the O(cid:0)N M 2(cid:1) cost\n\nof the calculation of the ELBO.\nIn order to highlight the capabilities of the new kernel, we now consider two toy tasks: classifying\nrectangles and distinguishing zeros from ones in MNIST.\n\nToy demo: rectangles The rectangles dataset is an arti\ufb01cial dataset containing 1200 images of size\n28\u00d7 28. Each image contains the outline of a randomly generated rectangle, and is labelled according\nto whether the rectangle has larger width or length. Despite its simplicity, the dataset is tricky for\nstandard kernel-based methods, including Gaussian processes, because of the high dimensionality of\nthe input, and the strong dependence of the label on multiple pixel locations.\nTo tackle the rectangles dataset with the convolutional GP, we used a patch size of 3 \u00d7 3 and 16\ninducing points initialised with uniform random noise. We optimised using Adam [31] (0.01 learning\nrate & 100 data points per minibatch) and obtained 1.4% error and a negative log predictive probability\n(nlpp) of 0.055 on the test set. For comparison, an RBF kernel with 1200 optimally placed inducing\npoints, optimised with BFGS, gave 5.0% error and an nlpp of 0.258. Our model is both better in terms\nof performance, and uses fewer inducing points. The model works because it is able to recognise\nand count vertical and horizontal bars in the patches. The locations of the inducing points quickly\nrecognise the horizontal and vertical lines in the images \u2013 see Figure 1a.\n\nIllustration: Zeros vs ones MNIST We perform a similar experiment for classifying MNIST 0\nand 1 digits. This time, we initialise using patches from the training data and use 50 inducing features,\nshown in \ufb01gure 1b. Features in the top left are in favour of classifying a zero, and tend to be diagonal\nor bent lines, while features for ones tend to be blank space or vertical lines. We get 0.3% error.\n\n5\n\n\fFull MNIST Next, we turn to the full multi-class MNIST dataset. Our setup follows Hensman\net al. [5], with 10 independent latent GPs using the same convolutional kernel, and constraining q(u)\nto a Gaussian (see section 2). It seems that this translation invariant kernel is too restrictive for this\ntask, since the error rate converges at around 2.1%, compared to 1.9% for the RBF kernel.\n\n4.2 Weighted convolutional kernels\n\nWe saw in the previous section that although the translation invariant kernel excelled at the rectangles\ntask, it under-performed compared to the RBF on MNIST. Full translation invariance is too strong\na constraint, which makes intuitive sense for image classi\ufb01cation, as the same feature in different\nlocations of the image can imply different classes. This can be remedied without leaving the family\nof Gaussian processes by relaxing the constraint of requiring each patch to give the same contribution,\nregardless of its position in the image. We do so by introducing a weight for each patch. Denoting\nagain the underlying patch-based GP as g, the image-based GP f is given by\n\nf (x) =\n\nwpg(x[p]) .\n\n(17)\n\nThe weights {wp}P\nOnly kf and kf u differ from the invariant case, and can be found to be:\n\np=1 adjust the relative importance of the response for each location in the image.\n\n(cid:88)\n\np\n\n(cid:88)\n(cid:88)\n\npq\n\nkf (x, x) =\n\nwpwqkg(x[p], xq) ,\n\nkf u(x, z) =\n\nwpkg(x[p], z) .\n\n(18)\n\n(19)\n\nThe patch weights w \u2208 RP are now kernel hyperparameters, and we optimise them with respect\nthe the ELBO in the same fashion as the underlying parameters of the kernel kg. This introduces P\nhyperparameters into the kernel \u2013 slightly less than the number of input pixels, which is how many\nhyperparameters an automatic relevance determination kernel would have.\n\np\n\nToy demo: rectangles The errors in the previous section were caused by rectangles along the edge\nof the image, which contained bars which only contribute once to the classi\ufb01cation score. Bars in the\ncentre contribute to multiple patches. The weighting allows some up-weighting of patches along the\nedge. This results in near-perfect classi\ufb01cation, with no classi\ufb01cation errors and an nlpp of 0.005.\n\nFull MNIST The weighting causes a signi\ufb01cant reduction in error over the translation invariant\nand RBF kernels (table 1 & \ufb01gure 2). The weighted convolutional kernel obtains 1.22% error \u2013 a\nsigni\ufb01cant improvement over 1.9% for the RBF kernel [5]. Krauth et al. [32] report 1.55% error\nusing an RBF kernel, but using a leave-one-out objective for \ufb01nding the hyperparameters.\n\n4.3 Does convolution capture everything?\n\nAs discussed earlier, the additive nature of the convolutional kernel places constraints on the possible\nfunctions in the prior. While these constraints have been shown to be useful for classifying MNIST,\nwe lose the guarantee (that e.g. the RBF provides) of being able to model any continuous function\narbitrarily well in the large-data limit. This is because convolutional kernels are not universal [33, 34]\nin the image input space, despite being nonparametric. This places convolutional kernels in a middle\nground between parametric and universal kernels (see the appendix for a discussion). A kernel\nthat is universal and has some amount of convolutional structure can be obtained by summing an\nRBF component: k(x, x(cid:48)) = krbf(x, x(cid:48)) + kconv(x, x(cid:48)). Equivalently, the GP is constructed by the\nsum f (x) = fconv(x) + frbf(x). This allows the universal RBF to model any residuals that the\nconvolutional structure cannot explain. We use the marginal likelihood estimate to automatically\nweigh how much of the process should be explained by each of the components, in the same way as\nis done in other additive models [27, 35].\nInference in such a model is straightforward under the usual inducing point framework \u2013 it only\nrequires evaluating the sum of kernels. The case considered here is more complicated since we want\nthe inducing inputs for the RBF to lie in the space of images, while we want to use inducing patches\n\n6\n\n\ffor the convolutional kernel. This forces us to use a slightly different form for the approximating GP,\nrepresenting the inducing inputs and outputs separately, as\n\n(cid:18)(cid:20)\u00b5conv\n\n(cid:21)\n\n\u00b5rbf\n\n(cid:19)\n\n, S\n\n,\n\n(cid:20)uconv\n\n(cid:21)\n\nurbf\n\n\u223c N\n\nf (\u00b7)| u = fconv(\u00b7)| uconv + frbf(\u00b7)| urbf .\n\n(20)\n\n(21)\n\nThe variational lower bound changes only through the equations (7) and (8), which must now contain\ncontributions of the two component Gaussian processes. If covariances in the posterior between fconv\nand frbf are to be allowed, S must be a full-rank 2M \u00d7 2M matrix. A mean-\ufb01eld approximation can\nbe chosen as well, in which case S can be M \u00d7 M block-diagonal, saving some parameters. Note\nthat regardless of which approach is chosen, the largest matrix to be inverted is still M \u00d7 M, as uconv\nand urbf are independent in the prior (see the appendix for more details).\n\nFull MNIST By adding an RBF component, we indeed get an extra reduction in error and nlpp\nfrom 1.22% to 1.17% and 0.048 to 0.039 respectively (table 1 & \ufb01gure 2). The variances for the\nconvolutional and RBF kernels are 14.3 and 0.011 respectively, showing that the convolutional kernel\nexplains most of the variance in the data.\n\n3\n\n2.5\n\n2\n\n1.5\n\n)\n\n%\n\n(\n\nr\no\nr\nr\ne\n\nt\ns\ne\nT\n\n1\n\n0\n\np\np\nl\nn\n\nt\ns\ne\nT\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n5\nTime (hrs)\n\n10\n\n0\n\n5\nTime (hrs)\n\n10\n\nFigure 2: Test error (left) and negative log predictive probability (nlpp, right) for MNIST, using RBF\n(blue), translation invariant convolutional (orange), weighted convolutional (green) and weighted\nconvolutional + RBF (red) kernels.\n\nKernel\nInvariant\n\nRBF\n\nWeighted\n\nWeighted + RBF\n\nM Error (%) NLPP\n750\n0.077\n0.068\n750\n750\n0.048\n750\n0.039\n\n2.08%\n1.90%\n1.22%\n1.17%\n\nTable 1: Final results for MNIST.\n\n4.4 Convolutional kernels for colour images\n\nOur \ufb01nal variants of the convolutional kernel handle images with multiple colour channels. The\naddition of colour presents an interesting modelling challenge, as the input dimensionality increases\nsigni\ufb01cantly, with a large amount of redundant information. As a baseline, the weighted convolutional\nkernel from section 4.2 can be used by taking all patches from each colour channel together, resulting\nin C times more patches, where C is the number of colour channels. This kernel can only account for\nlinear interactions between colour channels through the weights, and is also constrained to give the\nsame patch response regardless of the colour channel. A step up in \ufb02exibility would be to de\ufb01ne g(\u00b7)\nto take a w \u00d7 h \u00d7 C patch with all C colour channels. This trades off increasing the dimensionality\nof the patch-response function input with allowing it to learn non-linear interactions between the\ncolour channels. We call this the colour-patch variant. A middle ground that does not increase the\ndimensionality as much, is to use a different patch-response function gc(\u00b7) for each colour channel.\n\n7\n\n\fWe will refer to this as the multi-channel convolutional kernel. We construct the overall function f as\n\nP(cid:88)\n\nC(cid:88)\n\np=1\n\nc=1\n\n(cid:16)\n\nx[pc](cid:17)\n\nf (x) =\n\nwpcgc\n\n.\n\n(22)\n\nFor this variant, inference becomes similar to section 4.3, although for a different reason. While\nall gc(\u00b7)s can use the same inducing patch inputs, we need access to each gc(x[pc]) separately in\norder to fully specify f (x). This causes us to require separate inducing outputs for each gc. In our\napproximation, we share the inducing inputs, while, as was done in section 4.3, representing the\ninducing outputs separately. The equations for f (\u00b7)|u are changed only through the matrices Kfu\nand Kuu being N \u00d7 M C and M C \u00d7 M C respectively. Given that the gc(\u00b7) are independent in the\nprior, and the inducing inputs are constrained to be the same, Kuu is a block-diagonal repetition of\nkg (zm, zm(cid:48)). All the elements of Kfu are given by\n\np\n\nc=1\n\nkf gc(x, z) = E{gc}C\n\ngc(z)\n\nwpcgc\n\n(23)\nAs in section 4.3, we have the choice to represent a full CM \u00d7 CM covariance matrix for all inducing\nvariables u, or go for a mean-\ufb01eld approximation requiring only C M \u00d7 M matrices. Again, both\nversions require no expensive matrix operations larger than M \u00d7 M (see appendix).\nFinally, a simpli\ufb01cation can be made in order to avoid representing C patch-response functions. If\nthe weighting of each of the colour channels is constant w.r.t. the patch location (i.e. wpc = wpwc),\nthe model is equivalent to using a patch-response function with an additive kernel:\n\nwpckg(x[pc], z) .\n\n=\n\n(cid:34)(cid:88)\n\n(cid:16)\n\nx[pc](cid:17)\n\n(cid:35)\n\n(cid:88)\n\np\n\nf (x) =\n\nwp\n\nwcgc(x[pc]) =\n\nwp\u02dcg(x[pc]) ,\n\n(cid:88)\n\n(cid:88)\n\np\n\nc\n\n\u02dcg(\u00b7) \u223c GP\n\n(cid:32)\n\n(cid:88)\n\n(cid:88)\n(cid:33)\n\np\n\n0,\n\nwckc(\u00b7,\u00b7)\n\n.\n\n(24)\n\n(25)\n\nc\n\nCIFAR-10 We conclude the experiments by an investigation of CIFAR-10 [36], where 32 \u00d7 32\nsized RGB images are to be classi\ufb01ed. We use a similar setup to the previous MNIST experiments,\nby using 5 \u00d7 5 patches. Again, all latent functions share the same kernel for the prior, including the\npatch weights. We compare an RBF kernel to 4 variants of the convolutional kernel: the baseline\n\u201cweighted\u201d, the colour-patch, the colour-patch variant with additive structure (equation 24), and the\nmulti-channel with mean-\ufb01eld inference. All models use 1000 inducing inputs and are trained using\nAdam. Due to memory constraints on the GPU, a minibatch size of 40 had to be used for the weighted,\nadditive and multi-channel models.\nTest errors and nlpps during training are shown in \ufb01gure 3. Any convolutional structure signi\ufb01cantly\nimproves classi\ufb01cation performance, with colour interactions seeming particularly important, as the\nbest performing model is the multi-channel GP. The \ufb01nal error rate of the multi-channel kernel was\n35.4%, compared to 48.6% for the RBF kernel. While we acknowledge that this is far from state\nof the art using deep nets, it is a signi\ufb01cant improvement over existing Gaussian process models,\nincluding the 44.95% error reported by Krauth et al. [32], where an RBF kernel was used together\nwith their leave-one-out objective for the hyperparameters. This improvement is orthogonal to the\nuse of a new kernel.\n\n5 Conclusion\n\nWe introduced a method for ef\ufb01ciently using convolutional structure in Gaussian processes, akin to\nhow it has been used in neural nets. Our main contribution is showing how placing the inducing\ninputs in the space of patches gives rise to a natural inter-domain approximation that \ufb01ts in sparse GP\napproximation frameworks. We discuss several variations of convolutional kernels and show how they\ncan be used to push the performance of Gaussian process models on image datasets. Additionally, we\nshow how the marginal likelihood can be used to assess to what extent a dataset can be explained\nwith only convolutional structure. We show that convolutional structure is not suf\ufb01cient, and that\nperformance can be improved by adding a small amount of \u201cfully connected\u201d (RBF). The ability to\ndo this, and automatically tune the hyperparameters is a real strength of Gaussian processes. It would\nbe great if this ability could be incorporated in larger or deeper models as well.\n\n8\n\n\f60\n\n50\n\n40\n\n)\n\n%\n\n(\n\nr\no\nr\nr\ne\n\nt\ns\ne\nT\n\np\np\nl\nn\nt\ns\ne\nT\n\n2.6\n2.4\n2.2\n2.0\n1.8\n\n0\n\n10\n\n20\n\nTime (hrs)\n\n30\n\n40\n\n0\n\n10\n\n20\n\nTime (hrs)\n\n30\n\n40\n\nFigure 3: Test error (left) and nlpp (right) for CIFAR-10, using RBF (blue), baseline weighted\nconvolutional (orange), full-colour weighted convolutional (green), additive (red), and multi-channel\n(purple).\n\nAcknowledgements\n\nCER gratefully acknowledges support from EPSRC grant EP/J012300. MvdW is generously sup-\nported by a Qualcomm Innovation Fellowship.\n\nReferences\n[1] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning. MIT\n\nPress, 2006.\n\n[2] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings\n\nof the 12th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 567\u2013574, 2009.\n\n[3] James Hensman, Nicol\u00f2 Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Proceedings of\n\nthe 29th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 282\u2013290, 2013.\n\n[4] Alexander G. de G. Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahramani. On sparse\nvariational methods and the Kullback-Leibler divergence between stochastic processes. In Proceedings of\nthe 19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 231\u2013238, 2016.\n\n[5] James Hensman, Alexander G. de G. Matthews, Maurizio Filippone, and Zoubin Ghahramani. MCMC for\nvariationally sparse Gaussian processes. In Advances in Neural Information Processing Systems 28, pages\n1639\u20131647, 2015.\n\n[6] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):\n\n1\u2013127, January 2009. ISSN 1935-8237.\n\n[7] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. Advances in Neural Information Processing Systems 25, pages 1097\u20131105, 2012.\n\n[9] Andrew G. Wilson, Zhiting Hu, Ruslan R. Salakhutdinov, and Eric P. Xing. Stochastic variational deep\n\nkernel learning. In Advances in Neural Information Processing Systems, pages 2586\u20132594, 2016.\n\n[10] Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold gaussian\nprocesses for regression. In 2016 International Joint Conference on Neural Networks (IJCNN), pages\n3338\u20133345, 2016.\n\n[11] Radford M. Neal. Bayesian learning for neural networks, volume 118. Springer, 1996.\n[12] Joaquin Qui\u00f1onero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian\n\nprocess regression. Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[13] Matthias Seeger, Christopher K. I. Williams, and Neil D. Lawrence. Fast forward selection to speed up\nsparse Gaussian process regression. In Proceedings of the Ninth International Workshop on Arti\ufb01cial\nIntelligence and Statistics, 2003.\n\n[14] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in\n\nNeural Information Processing Systems 18, pages 1257\u20131264, 2005.\n\n[15] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes\n(KISS-GP). In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages\n1775\u20131784, 2015.\n\n[16] Miguel L\u00e1zaro-Gredilla, Joaquin Qui\u00f1onero-Candela, Carl Edward Rasmussen, and An\u00edbal R Figueiras-\nVidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11:1865\u20131881,\n2010.\n\n9\n\n\f[17] James Hensman, Nicolas Durrande, and Arno Solin. Variational fourier features for gaussian processes.\n\n[18] Manfred Opper and C\u00e9dric Archambeau. The variational Gaussian approximation revisited. Neural\n\narXiv preprint arXiv:1611.06740, 2016.\n\nComputation, 21(3):786\u2013792, 2009.\n\n[19] Alexander G. de G. Matthews. Scalable Gaussian Process Inference Using Variational Methods. PhD\nthesis, University of Cambridge, Cambridge, UK, 2016. available at http://mlg.eng.cam.ac.uk/\nmatthews/thesis.pdf.\n\n[20] Daniel Hern\u00e1ndez-Lobato and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Scalable gaussian process classi\ufb01cation via\n\nexpectation propagation. In Arti\ufb01cial Intelligence and Statistics, pages 168\u2013176, 2016.\n\n[21] Thang D. Bui, Josiah Yan, and Richard E. Turner. A unifying framework for sparse gaussian process\n\napproximation using power expectation propagation. arXiv preprint arXiv:1605.07066, May 2016.\n\n[22] Carlos Villacampa-Calvo and Daniel Hern\u00e1ndez-Lobato. Scalable multi-class Gaussian process classi\ufb01-\ncation using expectation propagation. In Proceedings of the 34th International Conference on Machine\nLearning, volume 70 of Proceedings of Machine Learning Research, pages 3550\u20133559, 2017.\n\n[23] Matthias Stephan Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic\n\nsparse gaussian process approximations. In Advances in neural information processing systems, 2016.\n\n[24] James Hensman, Alexander G. de G. Matthews, and Zoubin Ghahramani. Scalable variational Gaussian\nprocess classi\ufb01cation. In Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 351\u2013360, 2015.\n\n[25] Anibal Figueiras-Vidal and Miguel L\u00e1zaro-Gredilla. Inter-domain Gaussian processes for sparse inference\nusing inducing features. In Advances in Neural Information Processing Systems 22, pages 1087\u20131095.\nCurran Associates, Inc., 2009.\n\n[26] Nicolas Durrande, David Ginsbourger, and Olivier Roustant. Additive covariance kernels for high-\ndimensional Gaussian process modeling. In Annales de la Facult\u00e9 de Sciences de Toulouse, volume 21,\npages p\u2013481, 2012.\n\n[27] David K. Duvenaud, Hannes Nickisch, and Carl E. Rasmussen. Additive Gaussian processes. In Advances\n\nin neural information processing systems, pages 226\u2013234, 2011.\n\n[28] Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks.\n\nAdvances in Neural Information Processing Systems 27, pages 2627\u20132635, 2014.\n\n[29] Gaurav Pandey and Ambedkar Dukkipati. Learning by stretching deep networks. In Tony Jebara and\nEric P. Xing, editors, Proceedings of the 31st International Conference on Machine Learning (ICML-14),\npages 1719\u20131727. JMLR Workshop and Conference Proceedings, 2014.\n\n[30] Alexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke. Fujii, Alexis Boukouvalas,\nPablo Le\u00f3n-Villagr\u00e1, Zoubin Ghahramani, and James Hensman. GP\ufb02ow: A Gaussian process library using\nTensorFlow. Journal of Machine Learning Research, 18(40):1\u20136, 2017.\n\n[31] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[32] Karl Krauth, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. AutoGP: Exploring the capabilities\n\n[33] Ingo Steinwart. On the In\ufb02uence of the Kernel on the Consistency of Support Vector Machines. Journal of\n\nand limitations of Gaussian process models, 2016.\n\nMachine Learning Research, 2:67\u201393, 2001.\n\n[34] Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R. G. Lanckriet. Universality, characteristic kernels\n\nand rkhs embedding of measures. Journal of Machine Learning Research, 12:2389\u20132410, July 2011.\n\n[35] David K. Duvenaud, James R. Lloyd, Roger B. Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani.\nStructure discovery in nonparametric regression through compositional kernel search. In ICML (3), pages\n1166\u20131174, 2013.\n\n[36] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Learning multiple layers of features from tiny images.\nTechnical report, University of Toronto, 2009. URL http://www.cs.toronto.edu/~kriz/cifar.\nhtml.\n\n10\n\n\f", "award": [], "sourceid": 1636, "authors": [{"given_name": "Mark", "family_name": "van der Wilk", "institution": "University of Cambridge"}, {"given_name": "Carl Edward", "family_name": "Rasmussen", "institution": "University of Cambridge"}, {"given_name": "James", "family_name": "Hensman", "institution": "PROWLER.io"}]}