{"title": "Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features", "book": "Advances in Neural Information Processing Systems", "page_first": 9005, "page_last": 9016, "abstract": "We develop an efficient and provably no-regret Bayesian optimization (BO) algorithm for optimization of black-box functions in high dimensions. We assume a generalized additive model with possibly overlapping variable groups. When the groups do not overlap, we are able to provide the first provably no-regret \\emph{polynomial time} (in the number of evaluations of the acquisition function) algorithm for solving high dimensional BO. To make the optimization efficient and feasible, we introduce a novel deterministic Fourier Features approximation based on numerical integration with detailed analysis for the squared exponential kernel. The error of this approximation decreases \\emph{exponentially} with the number of features, and allows for a precise approximation of both posterior mean and variance. In addition, the kernel matrix inversion improves in its complexity from cubic to essentially linear in the number of data points measured in basic arithmetic operations.", "full_text": "Ef\ufb01cient High Dimensional Bayesian Optimization\nwith Additivity and Quadrature Fourier Features\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nAndreas Krause\n\nETH Zurich, Switzerland\nkrausea@inf.ethz.ch\n\nMojm\u00edr Mutn\u00fd\n\nETH Zurich, Switzerland\n\nmojmir.mutny@inf.ethz.ch\n\nAbstract\n\nWe develop an ef\ufb01cient and provably no-regret Bayesian optimization (BO) algo-\nrithm for optimization of black-box functions in high dimensions. We assume a\ngeneralized additive model with possibly overlapping variable groups. When the\ngroups do not overlap, we are able to provide the \ufb01rst provably no-regret polyno-\nmial time (in the number of evaluations of the acquisition function) algorithm for\nsolving high dimensional BO. To make the optimization ef\ufb01cient and feasible, we\nintroduce a novel deterministic Fourier Features approximation based on numeri-\ncal integration with detailed analysis for the squared exponential kernel. The error\nof this approximation decreases exponentially with the number of features, and al-\nlows for a precise approximation of both posterior mean and variance. In addition,\nthe kernel matrix inversion improves in its complexity from cubic to essentially\nlinear in the number of data points measured in basic arithmetic operations.\n\n1\n\nIntroduction\n\nBayesian Optimization (BO) is a versatile method for global optimization of a black-box function\nusing noisy point-wise observations. BO has been employed in selection of chemical compounds\n[21], online marketing [43], reinforcement learning problems [15, 29], and in search for hyperparam-\neters of machine learning algorithms [25]. BO requires a probabilistic model that reliably models\nthe uncertainty in the unexplored part of the domain of the black-box function. This model is used\nto de\ufb01ne an acquisiton function whose maximum determines the next sequential query of black-box\nfunction. A popular choice for a probabilistic model is a Gaussian process (GP), a generalization of\nGaussian random vector to the space of functions.\nBO is very successful when applied to functions of a low dimension. However already problems with\n5 and more dimensions can be challenging for general BO if they need to be optimized ef\ufb01ciently\nand to a high accuracy. Practical high dimensional BO with GPs usually incorporates an assumption\non the covariance structure of a GP, or the black-box function.\nIn this work, we focus on BO\nwith additive GPs [13], and generalized additive GPs [39] with possibly overlapping variable groups\nallowing cross group interference. Even with the additive models assumption, BO in high dimension\nremains a daunting task. There are two main problems associated with high dimensional BO with\ngeneralized additive GPs, namely, optimization of the acquisition function, and ef\ufb01cient handling of\nmany data points - large-scale BO.\nTo alleviate the two problems, using a generalized additive model assumption, and a popular ac-\nquisition function - Thompson sampling [50], we design ef\ufb01cient no-regret algorithms for solving\nhigh dimensional BO problems. Thompson sampling has an acquisition function which leads to\na natural block coordinate decomposition in the variable groups when used with additive models\nwithout overlapping groups, which reduces the complexity of the acquisition function. In fact, with\nthis assumption, we show that the number of evaluations of the acquisition function are polynomial\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: A GP \ufb01tted on noisy observations of g(x) with T = 1024 data points. One-(cid:27) con\ufb01dence bounds are\nprovided in the shaded regions. The parameter m denotes the size of the Fourier basis. RFF cannot produce\nreliable con\ufb01dence bounds - variance starvation. On the other hand, QFF do not have this problem, and provide\naccurate approximation even with a much smaller basis size. The true and approximated con\ufb01dence intervals\nintersect exactly in the example above. The example comes from [53].\n\nin the number of points queried during the BO process. The assembly of the acquisition function\ninvolves an inversion of kernel matrix and hence scales cubically with the number of data points T .\nTo ensure ef\ufb01cient and scalable optimization up to a high degree of accuracy without the spiraling\ncomputational cost, we devise a high \ufb01delity approximation scheme based on Fourier Features and\nmethods from numerical integration. We denote this approximation - Quadrature Fourier Features\n(QFF) in a contrast to Random Fourier Features (RFF) [36]. This scheme approximates a stationary\nkernel by a linear kernel of a \ufb01xed dimensionality in a particularly transformed space. For the ease\nof exposition we focus our analysis on the squared exponential kernel only, but the methods extend\nto a broader class of kernels.\nThe approximation scheme allows us to represent sample paths of a GP in a closed form, and hence\noptimize them ef\ufb01ciently to high accuracy. Moreover, the uniform approximation error of QFF\ndecreases exponentially with the size of the linear basis in contrast to the standard RFF, which\ndecreases with the inverse square root of the basis dimension. However, QFF scale unfavorably with\nthe effective dimensionality of the model making them unsuitable for an arbitrary high dimensional\nkernel approximation. Their strengths manifest on problems with a low dimension or a low effective\ndimension. In the context of generalized additive models, the effective dimension is the dimension\nof the largest variable group, which is usually small.\n\nPrevious Works High dimensional BO with GPs has been considered with assumptions either on\nthe covariance or the black-box function previously. Namely, [56, 12, 8] assume a low dimensional\nactive subspace of the black-box function, and [39, 14, 24, 54] assume (generalized) additive kernels.\nIn [14] authors propose a heuristic to identify the additive structure. However, satisfactory theoret-\nical certi\ufb01cates on cumulative regret or suf\ufb01ciently practical algorithms with acquisition functions\nthat can be ef\ufb01ciently optimized are lacking. In addition, [9] derives high probability bounds on\nThompson sampling with GPs in frequentist setting and [41] in the Bayesian framework framework.\nTo alleviate the computational cost of kernel methods (and GPs), the machine learning community\ndevised various approximation schemes. Among the plethora of approximations, Nystr\u00f6m Features\n[31], Random Fourier Features (RFF) [36, 35, 48] or more generally Fourier Features [3, 18, 10],\nand sparse GP (inducing point methods) [44, 28] stand out.\nInducing point methods is a rich and competitive class of algorithms [51, 59, 19]. Very recently,\n[34] extends the KISS-GP [59] and shows very accurate posterior sampling with linear complexity\n(in the number of data points) applied to Bayesian optimization. They utilize Toeplitz structure of\ncovariance matrices and an iterative linear system solver. However, their method is not theoretically\nanalyzed in either posterior moments convergence or cumulative regret in contrast to ours.\nThe approach most closely related to ours is that of [10] and [52]. Both works use methods from\nnumerical quadrature as well. The former proves exponential convergence for certain types of ap-\nproximations without providing an explicit construction. The latter considers additive kernels of a\n\n2\n\n\fdifferent class. In [2], authors consider an orthogonal direction to achieve the same cost of inversion\nas QFF from the perspective of linear algebra decomposition.\nThe kernel approximation in the connection with BO usually focuses on resolving the unfavorable\ncubic cost of kernel inversion. In this context, approximation schemes for GPs such as RFF and\nMondrian features [4] have been used in [55] and [53], respectively. However, [54] demonstrates\nan adversarial example where RFF cannot reproduce reliably the posterior variance - variance star-\nvation. A similar conclusion is found in [34] working with Max Value Entropy Search of [55]. We\nreproduce this example in Figure 1 and show that QFF (even with smaller basis set) do not suffer\nfrom this problem and reproduce the variance with high accuracy. More broadly, sparse GPs and\nBayesian Neural Networks as possible approximations of kernels have been considered in literature\nas heuristics for BO [30, 45, 46].\n\nContributions\n\n(cid:15) We develop a novel approximation strategy for kernel methods and GPs - Quadrature\nFourier Features (QFF). This approximation is uniform, and for squared exponential kernel\nits error provably decreases exponentially with the number of features.\n(cid:15) By introducing QFF, the computational cost of the kernel inversion for generalized additive\nmodels reduces from O(T 3) to O(T (log T )2) measured in basic arithmetic operations,\nwhere T is the number of data points. This approximation allows the use of BO in large-\nscale settings and speeds up the sequential calculation signi\ufb01cantly.\n(cid:15) We prove that Thompson sampling and GP-UCB [47] algorithms are no-regret when com-\nbined with QFF approximation, and for squared exponential kernel the bound is the same\nas without QFF approximation up to logarithmic factors.\n(cid:15) Using an additive kernel without overlapping groups and Thompson sampling acquisition\nfunction, QFF allow us to formulate a practical and provably computationally ef\ufb01cient\nalgorithm for high dimensional BO. This algorithm allows optimization of sample paths\nfor Thompson sampling to an arbitrary precision without the need to iteratively sample\nfrom the posterior.\n(cid:15) In the supplementary material we provide a general method to construct QFF for a other\n\nstationary kernels.\n\n2 Generalized Additive Gaussian Processes and Thompson Sampling\nA Gaussian process (GP) is fully characterized by its domain D (cid:18) Rd, its prior mean (assumed to\nbe zero here), and its kernel function k : D (cid:2) D ! R. It is a stochastic process whose all \ufb01nite\nmarginals are Gaussians, in particular, f (x) (cid:24) N ((cid:22)(x); (cid:27)(x)2), where (cid:22)(x) is the mean and (cid:27)(x)2\nis the variance. The covariance structure of the stochastic process is governed by the kernel function\nk(x; y).\n\nGeneralized Additive GPs Generalized additive models [39] are a generalization of additive mod-\nels [16] that decompose a function to a sum of functions gj de\ufb01ned over low-dimensional compo-\nnents. Namely,\n\ng(j)(x(j));\n\n(1)\nwhere each x(j) belongs to a low dimensional subspace X (j) (cid:18) D. With G, we always denote the\nnumber of these components. Additive models, in contrast to generalized additive models, imply\nthat X (j) \\ X (k) = \u2205 if k \u0338= j. In our work, we start with a generalized additive models and\nspecialize to additive models when needed.\nThe concept of additive models can be extended to Gaussian processes, where the stochastic pro-\ncess f is a sum of stochastic processes f =\nj=1 fj where each has low dimensional index-\ning (dimensions) [13, 38]. With the additive assumption, the kernel and the mean function of\nan generalized additive GP decomposes in the same fashion as the components fj. Namely,\nj=1 (cid:22)(j)(x(j)). This simpli\ufb01es the GP, and we\nk(x; y) =\n\nj=1 k(j)(x(j); y(j)) and (cid:22)(x) =\n\n\u2211\n\u2211\n\nG\n\nG\n\n\u2211\n\nG\n\nG\u2211\n\nj=1\n\ng(x) =\n\n3\n\n\fde\ufb01ne the effective dimensionality of the model as the largest dimension among all additive groups,\n(cid:22)d = maxj2[G] dim(X (j)). Next, we explain how these methods can be exploited with BO.\n\n\u2211\n\nBO with Posterior sampling BO sequentially generates points where the black-box function g(x)\nshould be queried. These points are maximizers of an acquisition function [7]. A popular class of\nstochastic acquisition functions without a generally tractable closed-form expression is Thompson\nsampling [50]. In Thompson sampling, a sample from the posterior GP is chosen as the acquisition\nfunction at each step.\nUsing the generalized additive assumption, the sample from a GP (f (cid:24) GP) decomposes as\nj=1 fj(x(j)). With the additive model assumption (no overlapping groups), the individ-\nf (x) =\nual functions depend on their speci\ufb01c variable groups only, i.e., X (j). Consequently, fj(x(j)) can\nbe optimized independently on a lower dimensional subspace. Due to this decomposition, Thomp-\nson sampling is a natural candidate for BO with additive models. However, the use of Thompson\nsampling in practice is limited by the computational problems associated with sampling from the\nposterior.\n\nG\n\nThe maximum of a sample path Principally, a sample path from a GP can be optimized using\nthree methods. The \ufb01rst, direct method, samples a path over the whole \ufb01nite domain at once and\n\ufb01nds the maximum. The standard way to sample on a discrete domain D is to perform a Cholesky\ndecomposition of the covariance matrix, which costs O(jDj3) basic arithmetic operations. Having\na \ufb01nely discretized domain D, this cost might be prohibitive, especially considering that jDj grows\nexponentially with the dimension. With the additive assumption (non-overlapping), the variable\ngroups are independent, thus one could sequentially sample the GP only on X (j) (cid:18) D, and condition\non these observations while iterating over groups. However, the method is sequential and requires\nre-computation of the posterior after each variable groups has been sampled. We refer to this method\nas canonical Thompson sampling with additive models in our benchmarks.\nThe second option is to sample iteratively. Here, we sample a value of the stochastic process at a\npoint and condition on it to sample the next one. With this approach we can optimize the acquisition\nfunction over a continuous domain. However, at every single iteration, a new posterior has to be\nrecomputed, which can again be prohibitively slow. The third approach is to use a \ufb01nite basis\napproximation. Fourier Features provide such an approximation, and their use is subject of Section\n4, where we introduce a closed form expression for the sample paths and their derivatives.\n\n3 Fourier Features for Bayesian Optimization\n\nBayesian Optimization and Uniform Approximation BO requires that the probabilistic model\nis a reliable proxy for uncertainty. In order to have a method which can truly and faithfully explore\nthe domain of the function, we need that the approximation to the uncertainty model is valid on\nthe whole optimization domain. Consequently, one requires a uniform approximation guarantee.\nSuch guarantees cannot be easily obtained by methods based on past observations such as Nystr\u00f6m\nfeatures [57], or other adaptive weighting methods unless the obtained data cover the whole domain.\nAs the purpose of BO is to ef\ufb01ciently probe the black-box function, these methods are not compatible\nwith the goal of BO.\nOne of the popular methods that uniformly approximate the kernel with theoretical guarantees is\nthe Fourier Features method. This approach is applicable to any continuous stationary kernel. Ac-\ncording to Bochner\u2019s theorem [40], any such kernel can be expressed as a Fourier integral of a dual\nfunction p(!) in the frequency space. Approximating this integral in a suitable manner can provide\na uniform approximation.\nDe\ufb01nition 1 (Uniform Approximation). Let k : D (cid:2) D ! R be a stationary kernel taking values\nfrom D (cid:26) Rd, then the inner product (cid:8)(x)\n(cid:8)(y) in Rm, \u03f5-uniformly approximates k if and only if,\n\n\u22a4\n\njk(x; y) (cid:0) (cid:8)(x)\n\u22a4\n\n(cid:8)(y)j (cid:20) \u03f5:\n\nsup\nx;y2D\n\n(2)\n\nIn De\ufb01nition 1, generally, \u03f5 has a functional dependence on m, the size of approximating basis. For\nfor Random Fourier Features. Our analysis reveals that the error of\n\n(cid:0)1=2\n\nexample, \u03f5(m) = O(\n\nm\n\n)\n\n4\n\n\fthe uniform approximation translates to the approximation guarantee on posterior mean, posterior\nvariance, and on the cumulative regret for common BO algorithms.\n\n3.1 General Fourier Features\n\nBochner\u2019s theorem states the existence of an integral representation for the kernel function, which\ncan be subsequently approximated via a \ufb01nite sum.\n\n\u222b\n\n(\n\n)\u22a4(\n\n)\n\nk(x (cid:0) y)\n\nBochner\u2019s thm.\n\n=\n\n\u22a4\ncos(!\n\u22a4\nsin(!\n\nx)\nx)\n\n\u22a4\ncos(!\n\u22a4\nsin(!\n\ny)\ny)\n\np(!)\n\n\u2126\n\nFourier F.(cid:25) (cid:8)(x)\n\u22a4\n\nd!\n\n(cid:8)(y)\n\n(3)\n\nThe \ufb01nite sum approximation is performed such that each term in the sum is a product of two\nanalytically identical terms, each depending on either x or y. This \ufb01nite sum, in effect, de\ufb01nes\na linear kernel in a new space via the mapping (cid:8). One of the approximations satisfying these\nrequirements is Monte Carlo sampling according to the distribution p(!). This is the approximation\nused for the celebrated Random Fourier Features (RFF) [36, 35, 3].\nLinear kernels are desirable as they can be dealt with ef\ufb01ciently. They have a \ufb01xed dimensionality,\nand the inversion of the kernel matrix scales with the dimension of the space rather than the number\nof data points, as is demonstrated in the next paragraph.\n\nThe Posterior with Fourier Features We denote the dimension of the Fourier Feature mapping\nin (3) with m. Then the covariance in this approximating linear space is de\ufb01ned by the following\nquantities. Let (cid:8)(Xt) = ((cid:8)(x1); : : : (cid:8)(xt))\n\n\u22a4 2 Rm(cid:2)t , then\n\n(cid:6)t = ((cid:8)(Xt)\n\n\u22a4\n\n(cid:8)(Xt) + (cid:26)2I)\n\nand\n\n(cid:23)t = ((cid:6)t)\n\n\u22a4\n(cid:0)1(cid:8)(Xt)\n\ny\n\n(4)\n\nwhere (cid:26) denotes the additive Gaussian noise incurred to the observations y of the true black-box\nfunction g(x). The approximated posterior mean then becomes ~(cid:22)t(x) = (cid:8)(x)\n(cid:23)t and the posterior\nvariance ~(cid:27)t(x)2 = (cid:26)2(cid:8)(x)\n\n2 = 1 (which is true for RFF and QFF).\n\nt (cid:8)(x), when \u2225(cid:8)(x)\u2225\n(cid:0)1\n\n(cid:6)\n\n\u22a4\n\n\u22a4\n\n3.2 Quadrature Fourier Features (QFF)\n\nThe literature on Fourier Features concentrates mostly on Random Fourier Features that use Monte\nCarlo approximation of the integral. In this work, we take the perspective of numerical integration\nto approximate the integral, and review the basics of numerical quadrature here. Subsequently, we\nuse Hermite-Gauss quadrature (a standard technique in numerical integration) to provide a uniform\napproximation over D for the squared exponential kernel - Quadrature Fourier Features (QFF) with\nexponentially decreasing error on the uniform approximation.\n\nj=1). Weights (fvjgm\n\nNumerical Quadrature A quadrature scheme for an integral on a real interval is de\ufb01ned by two\nsets of points - weights and nodes. Nodes are points in the domain of the integrand at which the\nfunction is evaluated (f!jgm\nj=1) are the scaling parameters that scale the eval-\nuations at the nodes. In addition, the integral is usually formulated with a weight function w(x)\nthat absorbs badly behaved properties of the integrand. For further details we refer the reader to the\nstandard literature on numerical analysis [22]. An extension to multiple dimensions can be done\nby so called Cartesian product grids (Def 2). Cartesian product grids grow exponentially with the\nnumber of dimensions, however for small dimensions they are very effective.\nDe\ufb01nition 2 (Cartesian product grid). Let D = [a; b]d, and B be the set of nodes of a quadrature\nscheme for [a; b]. Then the Cartesian product grid Bd = B (cid:2) B (cid:1)(cid:1)(cid:1) (cid:2) B, where (cid:2) denotes the\nCartesian product.\nAssumption 1 (Decomposability). Let k be a stationary kernel de\ufb01ned on Rd, s.t. k(x; y) (cid:20) 1 for\nall x; y 2 Rd with Fourier transform that decomposes product-wise p(!) =\n\n\u220f\n\nd\n\nj=1 pj(!j).\n\nQFF In order to de\ufb01ne QFF we need Assumption 1. This assumption is natural, and is satis\ufb01ed\nfor common kernels such as the squared exponential (even ARD after the change of variables) or the\nmodi\ufb01ed Mat\u00e9rn kernel. Further details can be found in the supplementary material.\n\n5\n\n\fDe\ufb01nition 3 (QFF). Under Assumption 1 let m = (2 (cid:22)m)d, where (cid:22)m 2 N. Suppose that x; y 2 [0; 1]d.\nLet p(!) = exp\nbe the Fourier transform of the kernel k. Then, we de\ufb01ne the\nmapping,\n\n!2\nj (cid:13)2\n2\n\nd\nj=1\n\nj\n\n\u22a4\n\nv(!j;i) cos((!j)\n\u22a4\nv(!j(cid:0)m;i) sin((!j(cid:0)m)\n\nx)\n\nif j (cid:20) m\nif 2m > j > m\n\n;\n\nx)\n\n(5)\n\nd\ni=1\nd\ni=1\n\n1\n(cid:13)j\n1\n(cid:13)j\n\n(\n\n(cid:0)\u2211\n\n)\n\n\u221a\u220f\n\u221a\u220f\n\n8<:\n\np\n\n(cid:8)(x)j =\n\np\n\nwhere v(!j;i) =\nby the Cartesian product of f(cid:22)!ig (cid:22)m\ni-th Hermite polynomial. See Gauss-Hermite quadrature in [22].\n\nm2Hm(cid:0)1(!j;i)2 and Hi is the ith Hermite polynomial. The set f!jgm\n2m(cid:0)1m!\n\nj=1 is formed\ni=1, where each element is in R and, is de\ufb01ned to be the zero of the\n\n2\n(cid:13)i\n\n(cid:25)\n\nThe general scaling of m with dimension d is exponential due to the use of Cartesian grids, however\nour application area - BO usually involves either small dimensional problems up to 5 dimensions,\nor high dimensional BO with low effective dimensions - generalized additive models - where these\nmethods are very effective.\n\nAdditive kernels When using generalized additive kernels k(x; y) =\nj=1 k(x(j); y(j)), we can\nuse QFF to approximate each single component independently with mapping (cid:8)(j)(x(j))\n(cid:8)(j)(y(j)),\nwith mj features, and stack them together to one vector (cid:4). In this way, the number of features needed\nscales exponentially only with the effective dimensions (cid:22)d, which is usually small even if d is large.\n\n\u22a4\n\nG\n\n\u2211\n\nApproximation Error We provide an upper bound on the error of uniform approximation guaran-\ntee that decreases exponentially with m.\nTheorem 1 (QFF error). Let (cid:8)(x) 2 Rm with m = (2 (cid:22)m)d be as in De\ufb01nition 3, with inputs in\nD = [0; 1]d and (cid:13) = mini (cid:13)i,\n\n(p\n\n2\n(cid:13)\n\n)2m (cid:20) d2(d(cid:0)1)\n\n\u221a\n\n(\n\n) (cid:22)m\n\n(cid:25)\n2\n\n1\n(cid:22)m (cid:22)m\n\ne\n4(cid:13)2\n\n: (6)\n\njk(x; y) (cid:0) (cid:8)(x)\n\u22a4\n\nsup\nx;y2D\n\np\n(cid:8)(y)j (cid:20) d2(d(cid:0)1) (cid:22)m!\n\n(cid:25)\n\n2 (cid:22)m(2 (cid:22)m)!\n\n(cid:0)2.\nTheorem 1 implies that if (cid:13) is very small, the decrease might be exponential only for m > (cid:13)\n(cid:0)2 at the\n(cid:3)\nThis is con\ufb01rmed by our numerical experiment in Figure 2c, and the break point m\nintersection of the two purple lines predicts the start of the exponential decrease. The error on the\nposterior mean with this approximation can be seen in Figures 2a and 2b. The exponential decrease\nof posterior mean with QFF follows from a Theorem 5 in supplementary material.\nFurthermore, for general additive kernels, the error bound in Theorem 1 depends only on the effec-\ntive dimension (cid:22)d, although the dimension d might be much larger. The fact that additive assumption\nimproves the error convergence can be seen in Figure 2d, where different models with different\neffective dimensionalities are presented. However, for all models, the dimensionality d = 3 stays\nconstant. The approximation has desirable properties even if the variables overlap as the Circular\nexample shows. The only requirement on ef\ufb01ciency is the low effective dimensionality.\n\n= (cid:13)\n\n4 Ef\ufb01cient Algorithm for High Dimensional Bayesian Optimization\n\nThompson Sampling Using Thompson sampling (TS) with Fourier Features approximation, we\nare able to devise an analytic form for the acquisition function. Namely, a sample path from the\napproximating GP amounts to sampling a \ufb01xed dimensional vector (cid:18)t (cid:24) ((cid:23)t; ((cid:6)t)\n(cid:0)1), where quan-\ntities come from (4). The rule for Thompson sampling with a generalized additive kernel becomes\n\nxt+1 = arg max\nx2D\n\n\u22a4\n\n(cid:4)(x)\n\n(cid:18)t = arg max\nx2D\n\nj=1\n\n(cid:8)(j)(x(j))\n\n\u22a4\n\n(cid:18)(j)\nt\n\n:\n\n(7)\n\nSince (cid:18)t has a \ufb01xed dimension m, the cost to compute the posterior and the sample path is con-\nstant O(m3), in contrast to O(t3) and O(jDj3) for the canonical TS. In addition, this formula-\ntions allows the use of \ufb01rst-order optimizers to optimize the acquisition function effectively. The\n\n6\n\nG\u2211\n\n\f(a) T = 1; (cid:13) = 0:2\n\n(b) T = 16; (cid:13) = 0:2\n\n(c) T = 4; m = 128\n\n(d) Additive k, (cid:13) = 0:7\n\nFigure 2: The plots show the error on uniform approximation of the posterior mean estimate. The black-\nbox function g is a sample from GP with squared exponential kernel. For 2a and 2b d = 2, for 2c d = 1,\nand 2d d = 3 (but some are additive). The tilde denotes the approximated quantities with Fourier Features.\nIn 2d Circular corresponds to overlapping groups\nThe parameter T represents the number of data points.\nf(x1; x2); (x2; x3); (x3; x1)g and Additive to two non-overlapping groups f(x1; x2); (x3)g.\n\n(cid:13)(cid:13)(cid:13)(cid:18)(j)\n\n(cid:13)(cid:13)(cid:13)\u221a\u2211mj =2\n\nt\n\ni=1 2v2\ni\n\nacquisition function for each variable group j is Lipschitz continuous with the constant L(j)\n\nt =\n\u2225!i\u22252, thus we can run a global optimization algorithm to optimize the acqui-\nsition function presented in (7) provably. Furthermore, optimization to a \ufb01ner accuracy does not\nrequire re-sampling or iterative posterior updates, and can be done adaptively with \ufb01rst-order opti-\nmizers or global optimization methods such as DIRECT [23] due to the availability of the analytic\nexpression once (cid:18)t has been sampled.\nWith the assumption on additivity (without overlapping groups), the optimization problem in (7)\ndecomposes over variable groups. Hence, one can perform block coordinate optimization inde-\npendently. For global optimization algorithms, we are able to provide a polynomial bound on the\nnumber of evaluations of the acquisition function for a \ufb01xed horizon T of queries to the black-box\nfunction.\nTheorem 2 (Polynomial Algorithm). Let (cid:14) 2 (0; 1), T 2 N be a \ufb01xed horizon for BO, k be an\nadditive squared exponential kernel with G groups and (cid:22)d - maximal dimension among the additive\ncomponents. Moreover, let (cid:8)(j)((cid:1)) 2 Rmj be the approximation of the jth additive component as\nin De\ufb01nition 3 with mj (cid:21) 2 log(cid:17)(T 3)dj and mj (cid:21) 1\n, where (cid:17) = 16=e. Then a Lipschitz global\noptimization algorithm [33] requires at most\n\n(cid:13)2\nj\n\n(\n\nO\n\nG log(T =(cid:14)) (cid:22)d=2\n\n(cid:11) (cid:22)d\n\n(\n\n)\n) (cid:22)d\n\nT 3=2(log T )\n\n(cid:22)d + T 2(log T )2\n\n(8)\n\nd\n\n(cid:11)\n\np\n\n(\n\nlog(T =(cid:14))\n\nT 2(log T )2\n\n))\n\nevaluations of the acquisition function (7) to reach accuracy (cid:11) for each optimization subproblem\nwith probability 1 (cid:0) (cid:14).\n(\n\nIn addition, when the kernel is fully additive ( (cid:22)d = 1) the number of evaluations is at most\nO\nIn practice thanks to the analytic formulation, one can perform\ngradient ascent to optimize the function with effectively constant work per iteration.\nThe polynomial algorithm is stated in full in Algorithm 1 with arbitrary Lipschitz global optimization\noracle. Note that by design, \ufb01rst the correlated (cid:18)t is sampled and only then is the acquisition function\ndecomposed and optimized in parallel. This ensures that we include the cross correlation of additive\ngroups, and yet decompose the acquisition function, which has been an open problem of Add-GP-\nUCB type algorithms [39].\n\n.\n\n7\n\n\fAlgorithm 1 Thompson sampling with Fourier Features and additive models\nRequire: Fourier Feature mapping (cid:8)(j)(x) 2 Rmj for each j 2 [G], (cid:11)t accuracy\nEnsure: Domain D = [0; 1]d, mj > 1\n(cid:13)2\nj\n\nfor t = 1; : : : ; T do\n\nUpdate (cid:23)t and (cid:6)t according to (4).\nSample (cid:18)t (cid:24) N ((cid:23)t; ((cid:6)t)\nfor j = 1; : : : ; G do\n\n(cid:0)1)\n\nFind x(j)\n\nt = arg maxx2D((cid:18)(j)\nt )\n\nend for\nQuery the function, i.e. yt = g(xt) + \u03f5t.\n\n\u22a4\n\n(cid:8)(j)(x(j))\n\nend for\n\n\u25b7 Calculate posterior\n\u25b7 Sampling via Cholesky decomp.\n\u25b7 Iterate over the variable groups\n\u25b7 global optimization\n\nOther Acquisition Functions Apart from Thompson sampling one can apply QFF to signi\ufb01cantly\nimprove sampling based acquisition functions such as Entropy Search (ES) [17], Predictive Entropy\nSearch (PES) [20] and Max-Value Entropy Search (MES) [54]. We focus on TS exclusively as the\nevaluations of the acquisition function is computationally more ef\ufb01cient. In the former methods,\none needs to create a statistics describing the maximizer or maximum of the Gaussian process via\nsampling.\n\n(cid:3)\n\n(cid:3)\n\nT\nt=1 g(x\n\n5 Regret Bounds\n\n\u2211\n)(cid:0) g(xt), which\nA theoretical measure for BO algorithms is the cumulative regret RT =\n) a priori.\nrepresents the cost associated by not knowing the optimum of the black-box function g(x\n! 0 as T ! 1. In this work,\nUsually one desires algorithms that are no-regret, meaning that RT\nT\nwe focus on algorithms with a \ufb01xed horizon T , and where observations of g(x) are corrupted with\nGaussian noise \u03f5 (cid:24) N (0; (cid:26)2). We provide bounds on the cumulative regret for Thompson sampling\nand GP-UCB (in appendix).\nIn the supplementary material, we provide a general regret analysis assuming an arbitrary \u03f5(m)-\nuniformly approximating kernel. This allows us to identify conditions on the dependence of \u03f5(m)\nsuch that an algorithm can be no-regret. RFF in contrast to QFF do not achieve sublinear cumulative\nregret with our analysis. For the exponentially decreasing error of QFF, we can prove that asymp-\ntotically our bound on the cumulative regret coincides (up to logarithmic factors) with the bound\non canonical Thompson sampling in the following theorem (similar result holds for UCB-GP). Our\nproof technique relies on the ideas introduced in [9].\nTheorem 3. Let (cid:14) 2 (0; 1), k be additive squared exponential kernel with G components, and\nthe black box function is bounded in RKHS norm. Then running Thompson sampling with the\napproximated kernel using QFF from De\ufb01nition (3) s.t. mj (cid:21) 2(log(cid:17)(T 3))dj , and mj > (cid:13)\n(cid:0)2\nfor each j 2 [G], where each acquisition function is optimized to the accuracy (cid:11)t = 1p\nj\nsuffers a\ncumulative regret bounded by,\n\nt\n\n(\n\nRT (cid:20) O\n\nG(log T )\n\n(cid:22)d+1\n\nT log\n\np\n\n)\n\n)3=2\n\n(\n\nT\n(cid:14)\n\n(9)\n\nwith probability 1 (cid:0) (cid:14), where (cid:22)d is the dimension of largest additive component.\nTheorem 3 implies that the size of Fourier basis for QFF needs to scale as m = O(log T 3)dj to have\na no-regret algorithm. Hence the kernel inversion for d = 1 in (4) needs only O(T (log T 3)2) basic\narithmetic operations, which can signi\ufb01cantly speed up the posterior calculation for low dimensional\nor low effective dimensional kernels, since for these we have m =\n\n\u2211\n\nG\n\nj=1 mj.\n\n6 Experimental Evaluation\n\nBenchmark functions We present cumulative regret plots for standard benchmarks with the\nsquared exponential kernel (Figure 3). We test Thompson sampling with QFF for a \ufb01xed hori-\nzon with high-dimensional functions used previously in [14]. Details of the experiments are in the\n\n8\n\n\f(a) d = (cid:22)d = 2\n\n(b) d = 20, (cid:22)d = 1\n\n(c) d = 10, (cid:22)d = 1\n\n(d) d = 5, (cid:22)d = 2\n\n(e) d = 5, (cid:22)d = 1\n\n(f) Runtime comparison\n\nFigure 3: In these graphs we compare exact Thompson sampling (TS-exact), RFF approximation (TS-RFF)\nand QFF approximation (TS-QFF). We plot the cumulative regret divided by the horizon (iteration) T , similar\nto statements in our theorems. The pre\ufb01x Advs suggests that we started with a set of observations larger than\nthe Fourier basis located in a selected (negative) part the domain in the spirit of the example in Figure 1. For\nevery experiment the full dimension d and the dimension of the largest additive component (cid:22)d is speci\ufb01ed. The\nfunctional forms can be found in the supplementary material.\n\nsupplementary material. We compare QFF, RFF and the exact GP. In Figure 3f, we show that for\neach experiment the speed of computation improves signi\ufb01cantly even though for high dimensional\nexperiments the grid for Lipschitz optimization was twice as \ufb01ne as for the exact method. In some\ninstances the QFF performs better than the BO with exact GP. We hypothesize that in these cases\nQFF serves as a regularizer and simpli\ufb01es the BO problem; or in the case of high dimensional func-\ntions, we were able to optimize the function with a \ufb01ner grid than the non-exact method. In addition,\nRFF perform well on experiments without adversarial initialization, which suggests that on average\nthis approximation can seem to work, but there are adversarial cases like in Figure 3c, where RFF\nfail.\n\nTuning free electron laser\nIn Figure 3e we present an experiment on real-world objective. This\nexperiment presents preliminary results on automatic tuning of hyperparameters for a large free\nelectron laser SwissFEL located at PSI in Switzerland. We run our algorithm on a simulator that\nwas \ufb01t with the data collected from the machine. In the \ufb01tting, we used the additive assumption.\nThe simulator reliably models the considerable noise level in the measurements. This experiment\nis an unusual example of BO as measurements can be obtained very quickly at frequencies up to\n1 Hz. However, the results are representative for only a couple of hours due to drift in the system.\nTherefore, the desire for a method which has an acquisition function that can be ef\ufb01ciently optimized\nin high dimensions is paramount. The cost of the optimization with our method is \ufb01xed and does not\nvary with the number of data points. Due to very noisy evaluations, the number of queries needed\nto achieve the optimum is considerable. Our method is the only method which ful\ufb01lls these criteria,\nand has provable guarantees. We show that the runtime of our algorithm is an order of magnitude\nlower than the canonical algorithm, and reaches better solutions as we can afford to optimize the\nacquisition function to higher accuracy.\n\n7 Conclusion\n\nWe presented an algorithm for high dimensional BO with generalized additive kernels based on\nThompson sampling. We show that the algorithm is no-regret and needs only a polynomial num-\nber of evaluations of the acquisition function with a \ufb01xed horizon. In addition, we introduced a\nnovel deterministic Fourier Features based approximation of a squared exponential kernel for this\nalgorithm. This approximation is well suited for generalized additive models with a low effective di-\nmension. The approximation error decreases exponentially with the size of the basis for the squared\nexponential kernel.\n\n9\n\n\fAcknowlegements\n\nThis research was supported by SNSF grant 407540_167212 through the NRP 75 Big Data program.\nThe authors would like to thank Johannes Kirschner for valuable discussions. In addition, we thank\nSwissFEL team for provision of the preliminary data from the free electron laser. In particular we\nthank Nicole Hiller, Franziska Frei and Rasmus Ischebeck of Paul Scherrer Institute, Switzerland.\n\nReferences\n[1] Yasin Abbasi-Yadkori and Csaba Szepesvari. Online learning for linearly parametrized control\n\nproblems. PhD thesis, University of Alberta, 2012.\n\n[2] Sivaram Ambikasaran, Daniel Foreman-Mackey, Leslie Greengard, David W. Hogg, and\nMichael O\u2019Neil. Fast direct methods for Gaussian processes. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, 38(2):252\u2013265, 2016.\n\n[3] Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael Mahoney. Quasi-Monte Carlo feature\nmaps for shift-invariant kernels. The Journal of Machine Learning Research, 17(1):4096\u20134133,\n2016.\n\n[4] Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M Roy, and Yee Whye\n\nTeh. The Mondrian kernel. In Uncertainty in Arti\ufb01cial Intelligence, pages 32\u201341, 2016.\n\n[5] John P Boyd. Exponentially convergent Fourier-Chebshev quadrature schemes on bounded\n\nand in\ufb01nite intervals. Journal of scienti\ufb01c computing, 2(2):99\u2013109, 1987.\n\n[6] John P Boyd. Chebyshev and Fourier spectral methods. Courier Corporation, 2001.\n[7] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of ex-\npensive cost functions, with application to active user modeling and hierarchical reinforcement\nlearning. arXiv preprint arXiv:1012.2599, 2010.\n\n[8] Bo Chen, Rui Castro, and Andreas Krause. Joint optimization and variable selection of high-\n\ndimensional Gaussian processes. 2012.\n\n[9] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Interna-\n\ntional Conference on Machine Learning, 2017.\n\n[10] Tri Dao, Christopher M De Sa, and Christopher R\u00e9. Gaussian quadrature for kernel features.\n\nIn Advances in Neural Information Processing Systems, pages 6109\u20136119, 2017.\n\n[11] Kai Diethelm. Error bounds for the numerical integration of functions with limited smoothness.\n\nSIAM Journal on Numerical Analysis, 52(2):877\u2013879, 2014.\n\n[12] Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional Gaussian process\n\nbandits. In Advances in Neural Information Processing Systems, pages 1025\u20131033, 2013.\n\n[13] David K Duvenaud, Hannes Nickisch, and Carl E Rasmussen. Additive Gaussian processes.\n\nIn Advances in neural information processing systems, pages 226\u2013234, 2011.\n\n[14] Jacob Gardner, Chuan Guo, Kilian Weinberger, Roman Garnett, and Roger Grosse. Discover-\ning and exploiting additive structure for Bayesian optimization. In Arti\ufb01cial Intelligence and\nStatistics, pages 1311\u20131319, 2017.\n\n[15] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov\n\ndecision processes. In Conference on Learning Theory, pages 861\u2013898, 2015.\n\n[16] Trevor J Hastie and Robert J Tibshirani. Generalized additive models, volume 43 of mono-\n\ngraphs on statistics and applied probability, 1990.\n\n[17] Philipp Hennig and Christian J Schuler. Entropy search for information-ef\ufb01cient global opti-\n\nmization. Journal of Machine Learning Research, 13(Jun):1809\u20131837, 2012.\n\n[18] James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian\n\nprocesses. Journal of Machine Learning Research, 18:1\u201352, 2018.\n\n[19] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. Uncer-\n\ntainty in Arti\ufb01cial Intelligence, 2013.\n\n[20] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive\nentropy search for ef\ufb01cient global optimization of black-box functions. In Advances in neural\ninformation processing systems, pages 918\u2013926, 2014.\n\n10\n\n\f[21] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, James Requeima, Edward O. Pyzer-Knapp, and Al\u00e1n Aspuru-\nGuzik. Parallel and distributed Thompson sampling for large-scale accelerated exploration of\nchemical space. International Conference on Machine Learning, 2017.\n\n[22] Francis Begnaud Hildebrand. Introduction to numerical analysis. Courier Corporation, 1987.\n[23] Donald R Jones. Direct global optimization algorithmdirect global optimization algorithm. In\n\nEncyclopedia of optimization, pages 431\u2013440. Springer, 2001.\n\n[24] Kirthevasan Kandasamy, Jeff Schneider, and Barnab\u00e1s P\u00f3czos. High dimensional Bayesian op-\ntimisation and bandits via additive models. In International Conference on Machine Learning,\npages 295\u2013304, 2015.\n\n[25] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast Bayesian\noptimization of machine learning hyperparameters on large datasets. International Conference\non Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[26] Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization.\n\nAdvances in Neural Information Processing Systems, pages 2447\u20132455, 2011.\n\nIn\n\n[27] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model\n\nselection. Annals of Statistics, pages 1302\u20131338, 2000.\n\n[28] Miguel Lazaro-Gredilla, Joaquin Quionero-Candela, Carl Edward Rasmussen, and Anabal\nFigueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learn-\ning Research, 11(Jun):1865\u20131881, 2010.\n\n[29] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait opti-\n\nmization with Gaussian process regression. In IJCAI, volume 7, pages 944\u2013949, 2007.\n\n[30] Mitchell McIntire, Daniel Ratner, and Stefano Ermon. Sparse Gaussian processes for Bayesian\n\noptimization. In Uncertainty in Arti\ufb01cial Intelligence, 2016.\n\n[31] Brian McWilliams, David Balduzzi, and Joachim M Buhmann. Correlated random features for\nfast semi-supervised learning. In Advances in Neural Information Processing Systems, pages\n440\u2013448, 2013.\n\n[32] Mojm\u00edr Mutn\u00fd and Peter Richt\u00e1rik. Parallel stochastic newton method. Journal of Computa-\n\ntional Mathematics, 36(3):405\u2013426, 2018.\n\n[33] Yurii Nesterov. Introduction to convex optimization: A basic course. Springer, 2004.\n[34] Geoff Pleiss, Jacob R. Gardner, Kilian Q. Weinberger, and Andrew Gordon Wilson. Constant-\nInternational Conference on Machine\n\ntime predictive distributions for Gaussian processes.\nLearning, 2018.\n\n[35] Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In\nCommunication, Control, and Computing, 2008 46th Annual Allerton Conference on, pages\n555\u2013561. IEEE, 2008.\n\n[36] Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines.\n\nAdvances in Neural Information Processing Systems, volume 3, page 5, 2007.\n\nIn\n\n[37] Carl Rasmussen and Chris Williams. Gaussian processes for machine learning. The MIT Press,\n\nCambridge, doi, 10:S0129065704001899, 2006.\n\n[38] Pradeep Ravikumar, Han Liu, John Lafferty, and Larry Wasserman. Spam: Sparse additive\nIn Advances in Neural Information Processing Systems, pages 1201\u20131208. Curran\n\nmodels.\nAssociates Inc., 2007.\n\n[39] Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional\nBayesian optimization via additive models with overlapping groups. International Conference\non Arti\ufb01cial Intelligence and Statistics, 84, 2018.\n\n[40] Walter Rudin. Principles of Mathematical Analysis (International Series in Pure & Applied\n\nMathematics). McGraw-Hill Publishing Co., 1976.\n\n[41] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathemat-\n\nics of Operations Research, 39(4):1221\u20131243, 2014.\n\n[42] Bernhard Sch\u00f6lkopf, Ralf Herbrich, and Alex Smola. A generalized representer theorem. In\n\nComputational learning theory, pages 416\u2013426. Springer, 2001.\n\n11\n\n\f[43] Steven L Scott. Multi-armed bandit experiments in the online service economy. Applied\n\nStochastic Models in Business and Industry, 31(1):37\u201345, 2015.\n\n[44] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In\n\nAdvances in neural information processing systems, pages 1257\u20131264, 2006.\n\n[45] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sun-\ndaram, Md Mostofa Ali Patwary, Mr Prabhat, and Ryan P Adams. Scalable Bayesian optimiza-\nIn International Conference on Machine Learning, pages\ntion using deep neural networks.\n2171\u20132180, 2015.\n\n[46] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimiza-\nIn Advances in Neural Information Processing\n\ntion with robust Bayesian neural networks.\nSystems, pages 4134\u20134142, 2016.\n\n[47] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process\noptimization in the bandit setting: No regret and experimental design. International Confer-\nence on Machine Learning, 2010.\n\n[48] Bharath Sriperumbudur and Zolt\u00e1n Szab\u00f3. Optimal rates for random Fourier features.\n\nAdvances in Neural Information Processing Systems, pages 1144\u20131152, 2015.\n\nIn\n\n[49] Josef Stoer and Roland Bulirsch. Introduction to numerical analysis, 2nd printing. Springer-\n\nVerlag, Berlin and New York, 1983.\n\n[50] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[51] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In\n\nArti\ufb01cial Intelligence and Statistics, pages 567\u2013574, 2009.\n\n[52] Andrea Vedaldi and Andrew Zisserman. Ef\ufb01cient additive kernels via explicit feature maps.\n\nIEEE transactions on pattern analysis and machine intelligence, 34(3):480\u2013492, 2012.\n\n[53] Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Ensemble Bayesian opti-\n\nmization. International Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[54] Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale\nBayesian optimization in high-dimensional spaces. International Conference on Arti\ufb01cial In-\ntelligence and Statistics, 2018.\n\n[55] Zi Wang and Stefanie Jegelka. Max-value entropy search for ef\ufb01cient Bayesian optimization.\n\nInternational Conference on Machine Learning, 2017.\n\n[56] Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Feitas. Bayesian\noptimization in a billion dimensions via random embeddings. Journal of Arti\ufb01cial Intelligence\nResearch, 55:361\u2013387, 2016.\n\n[57] Christopher KI Williams and Matthias Seeger. Using the nystr\u00f6m method to speed up kernel\n\nmachines. In Advances in neural information processing systems, pages 682\u2013688, 2001.\n\n[58] Andrew Gordon Wilson, Christoph Dann, and Hannes Nickisch. Thoughts on massively scal-\n\nable Gaussian processes. arXiv preprint arXiv:1511.01870, 2015.\n\n[59] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured\nIn International Conference on Machine Learning, pages\n\nGaussian processes (KISS-GP).\n1775\u20131784, 2015.\n\n12\n\n\f", "award": [], "sourceid": 5390, "authors": [{"given_name": "Mojmir", "family_name": "Mutny", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}