{"title": "Bayesian Optimization under Heavy-tailed Payoffs", "book": "Advances in Neural Information Processing Systems", "page_first": 13790, "page_last": 13801, "abstract": "We consider black box optimization of an unknown function in the nonparametric Gaussian process setting when the noise in the observed function values can be heavy tailed. This is in contrast to existing literature that typically assumes sub-Gaussian noise distributions for queries. Under the assumption that the unknown function belongs to the Reproducing Kernel Hilbert Space (RKHS) induced by a kernel, we first show that an adaptation of the well-known GP-UCB algorithm with reward truncation enjoys sublinear $\\tilde{O}(T^{\\frac{2 + \\alpha}{2(1+\\alpha)}})$ regret even with only the $(1+\\alpha)$-th moments, $\\alpha \\in (0,1]$, of the reward distribution being bounded ($\\tilde{O}$ hides logarithmic factors). However, for the common squared exponential (SE) and Mat\\'{e}rn kernels, this is seen to be significantly larger than a fundamental $\\Omega(T^{\\frac{1}{1+\\alpha}})$ lower bound on regret. We resolve this gap by developing novel Bayesian optimization algorithms, based on kernel approximation techniques, with regret bounds matching the lower bound in order for the SE kernel. We numerically benchmark the algorithms on environments based on both synthetic models and real-world data sets.", "full_text": "Bayesian Optimization under Heavy-tailed Payoffs\n\nSayak Ray Chowdhury\n\nDepartment of ECE\n\nIndian Institute of Science\nBangalore, India 560012\nsayak@iisc.ac.in\n\nAditya Gopalan\nDepartment of ECE\n\nIndian Institute of Science\nBangalore, India 560012\naditya@iisc.ac.in\n\nAbstract\n\n(cid:16)\n\nWe consider black box optimization of an unknown function in the nonparametric\nGaussian process setting when the noise in the observed function values can be\nheavy tailed. This is in contrast to existing literature that typically assumes sub-\nGaussian noise distributions for queries. Under the assumption that the unknown\nfunction belongs to the Reproducing Kernel Hilbert Space (RKHS) induced by\na kernel, we \ufb01rst show that an adaptation of the well-known GP-UCB algorithm\nwith reward truncation enjoys sublinear \u02dcO\nregret even with only the\n(1 + \u03b1)-th moments, \u03b1 \u2208 (0, 1], of the reward distribution being bounded ( \u02dcO\nhides logarithmic factors). However, for the common squared exponential (SE)\nand Mat\u00e9rn kernels, this is seen to be signi\ufb01cantly larger than a fundamental\n1+\u03b1 ) lower bound on regret. We resolve this gap by developing novel Bayesian\n\u2126(T\noptimization algorithms, based on kernel approximation techniques, with regret\nbounds matching the lower bound in order for the SE kernel. We numerically\nbenchmark the algorithms on environments based on both synthetic models and\nreal-world data sets.\n\n(cid:17)\n\n1\n\n2+\u03b1\n\n2(1+\u03b1)\n\nT\n\nIntroduction\n\n1\nBlack-box optimization of an unknown function f : Rd \u2192 R with expensive, noisy queries is a\ngeneric problem arising in domains such as hyper-parameter tuning for complex machine learning\nmodels [3], sensor selection [14], synthetic gene design [15], experimental design etc. The popular\nBayesian optimization (BO) approach, towards solving this problem, starts with a prior distribution,\ntypically a nonparametric Gaussian process (GP), over a function class, uses function evaluations\nto compute the posterior distribution over functions, and chooses the next function evaluation\nadaptively \u2013 using a sampling strategy \u2013 towards reaching the optimum. Popular sampling strategies\ninclude expected improvement [25], probability of improvement [40], upper con\ufb01dence bounds [35],\nThompson sampling [11], predictive-entropy search [17], etc.\nThe design and analysis of adaptive sampling strategies for BO typically involves the assumption of\nbounded, or at worst sub-Gaussian, distributions for rewards (or losses) observed by the learner, which\nis quite light-tailed. Yet, many real-world environments are known to exhibit heavy-tailed behavior,\ne.g., the distribution of delays in data networks is inherently heavy-tailed especially with highly\nvariable or bursty traf\ufb01c \ufb02ow distributions that are well-modeled with heavy tails [20], heavy-tailed\nprice \ufb02uctuations are common in \ufb01nance and insurance data [29], properties of complex networks\noften exhibit heavy tails such as degree distribution [37], etc. This motivates studying methods for\nBayesian optimization when observations are signi\ufb01cantly heavy tailed compared to Gaussian.\nA simple version of black box optimization \u2013 in the form of online learning in \ufb01nite multi-armed\nbandits (MABs) \u2013 with heavy-tailed payoffs, was \ufb01rst studied rigorously by Bubeck et al. [8], where\nthe payoffs are assumed to have bounded (1 + \u03b1)-th moment for \u03b1 \u2208 (0, 1]. They showed that for\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fMABs with only \ufb01nite variances (i.e., \u03b1 = 1), by using statistical estimators that are more robust than\nthe empirical mean, one can still recover the optimal regret rate for MAB under the sub-Gaussian\nassumption. Moving further, Medina and Yang [24] consider these estimators for the problem of\nlinear (parametric) stochastic bandits under heavy-tailed rewards and Shao et al. [34] show that almost\noptimal algorithms can be designed by using an optimistic, data-adaptive truncation of rewards. Some\nother important works include pure exploration under heavy-tailed noise [43], payoffs with bounded\nkurtosis [23], extreme bandits [10], heavy tailed payoffs with \u03b1 \u2208 (0,\u221e) [38].\nAgainst this backdrop, we consider regret minimization with heavy-tailed reward distributions in\nbandits with a potentially continuous arm set, and whose (unknown) expected reward function is\nnonparametric and assumed to have smoothness compatible with a kernel on the arm set. Here, it\nis unclear if existing BO techniques relying on statistical con\ufb01dence sets based on sub-Gaussian\nobservations can be made to work to attain nontrivial regret, since it is unlikely that these con\ufb01dence\nsets will at all be correct. It is worth mentioning that in the \ufb01nite dimensional setting, Shao et al. [34]\nsolve the problem almost optimally, but their results do not carry over to the general nonparametric\nkernelized setup since their algorithms and regret bounds depend crucially on the \ufb01nite feature\ndimension. We answer this af\ufb01rmatively in this work, and formalize and solve BO under heavy tailed\nnoise almost optimally. Speci\ufb01cally, this paper makes the following contributions.\n\n2+\u03b1\n\n\u2022 We adapt the GP-UCB algorithm to heavy-tailed payoffs by a truncation step, and show\nthat it enjoys a regret bound of \u02dcO(\u03b3T T\n2(1+\u03b1) ) where \u03b3T depends on the kernel associated\nwith the RKHS and is generally sub-linear in T . This regret rate, however, is potentially\n1+\u03b1 ) fundamental lower bound on regret that we show for two\nsub-optimal due to a \u2126(T\nspeci\ufb01c kernels, namely the squared exponential (SE) kernel and the Mat\u00e9rn kernel.\n\u2022 We develop a new Bayesian optimization algorithm by truncating rewards in each direction\nof an approximate, \ufb01nite-dimensional feature space. We show that the feature approximation\ncan be carried out by two popular kernel approximation techniques: Quadrature Fourier\nfeatures [26] and Nystr\u00f6m approximation [9]. The new algorithm under either approximation\nscheme gets regret \u02dcO(\u03b3T T\n\n1+\u03b1 ), which is optimal upto log factors for the SE kernel.\n\n\u2022 Finally, we report numerical results based on experiments on synthetic as well as real-world\nbased datasets, for which the algorithms we develop are seen to perform favorably in the\nharsher heavy-tailed environments.\n\n1\n\n1\n\nRelated work. An alternative line of work uses approaches for black box optimization based on\nLipschitz-type smoothness structure [22, 7, 2, 33], which is qualitatively different from RKHS\nsmoothness type assumptions. Recently, Bogunovic et al. [5] consider GP optimization under an\nadversarial perturbation of the query points. But, the observation noise is assumed to be Gaussian\nunlike our heavy-tailed environments. Kernel approximation schemes in the context of BO usually\nfocuses on reducing the cubic cost of gram matrix inversion [39, 41, 26, 9]. However, we crucially\nuse these approximations to achieve optimal regret for BO under heavy tailed noise, which, we\nbelieve, might not be possible without resorting to the kernel approximations.\n\n2 Problem formulation\nLet f : X \u2192 R be a \ufb01xed but unknown function over a domain X \u2282 Rd for some d \u2208 N. At every\nround, a learner queries f at a single point xt \u2208 X , and observes a noisy payoff yt = f (xt) + \u03b7t.\nHere the noise sequence \u03b7t, t \u2265 1 are assumed to be zero mean i.i.d. random variables such that\n\n(cid:105) \u2264 v for some \u03b1 \u2208 (0, 1] and v \u2208 (0,\u221e), where Ft\u22121 =\n\nthe payoffs satisfy E(cid:104)|yt|1+\u03b1 |Ft\u22121\nround t \u2212 1. The learner\u2019s goal is to maximize its (expected) cumulative reward(cid:80)T\ntime horizon T or equivalently minimize its cumulative regret RT =(cid:80)T\n\n\u03c3({x\u03c4 , y\u03c4 )}t\u22121\n\u03c4 =1, xt) denotes the \u03c3-algebra generated by the events so far1. Observe that this bound\non the (1 + \u03b1)-th moment at best yields bounded variance for yt, and does not necessarily mean\nthat yt (or \u03b7t) is sub-Gaussian as is assumed typically. The query point xt at round t is chosen\ncausally depending upon the history {(xs, ys)}t\u22121\ns=1 of query and payoff sequences available up to\nt=1 f (xt) over a\nt=1 (f (x(cid:63)) \u2212 f (xt)), where\n1If instead the moment bound holds for each \u03b7t then this can be translated to a moment bound for each yt\n\nusing, say, a bound on f (x).\n\n2\n\n\fx(cid:63) \u2208 argmaxx\u2208X f (x) is a maximum point of f (assuming the maximum is attained; not necessarily\nunique). A sublinear growth of RT with T implies the time-average regret RT /T \u2192 0 as T \u2192 \u221e.\nRegularity assumptions: Attaining sub-linear regret is impossible in general for arbitrary reward\nfunctions f, and thus some regularity assumptions are needed. In this paper, we assume smoothness\nfor f induced by the structure of a kernel on X . Speci\ufb01cally, we make the standard assumption of a\np.s.d. kernel k : X \u00d7 X \u2192 R such that k(x, x) \u2264 1 for all x \u2208 X , and f being an element of the\nreproducing kernel Hilbert space (RKHS) Hk(X ) of smooth real valued functions on X . Moreover,\nthe RKHS norm of f is assumed to be bounded, i.e., (cid:107)f(cid:107)H \u2264 B for some B < \u221e. Boundedness\nof k along the diagonal holds for any stationary kernel, i.e., where k(x, x(cid:48)) = k(x \u2212 x(cid:48)), e.g., the\nSquared Exponential kernel kSE and the Mat\u00b4ern kernel kMat\u00e9rn:\n21\u2212\u03bd\n\u0393(\u03bd)\n\nand kMat\u00e9rn(x, x(cid:48)) =\n\nkSE(x, x(cid:48)) = exp\n\n\u2212 r2\n2l2\n\n(cid:33)\u03bd\n\n2\u03bd\nl\n\n2\u03bd\nl\n\n(cid:32)\n\n(cid:32)\n\n(cid:33)\n\n(cid:18)\n\n(cid:19)\n\n\u221a\n\nr\n\n\u221a\n\nr\n\nB\u03bd\n\n,\n\nwhere l > 0 and \u03bd > 0 are hyperparameters of the kernels, r = (cid:107)x \u2212 x(cid:48)(cid:107)2 is the distance between x\nand x(cid:48), and B\u03bd is the modi\ufb01ed Bessel function.\n\n3 Warm-up: the \ufb01rst algorithm\n\n(cid:17)\n\n(cid:16) \u03b32R2\n\n(cid:3) \u2264 exp\n\nTowards designing a BO algorithm for heavy tailed observations, we brie\ufb02y recall the stan-\ndard GP-UCB algorithm for the sub-Gaussian setting. GP-UCB at time t chooses the point\nxt = argmaxx\u2208X \u00b5t\u22121(x) + \u03b2t\u03c3t\u22121(x) where \u00b5t(x) = kt(x)T (Kt + \u03bbIt)\u22121Yt and \u03c32\nt (x) =\nk(x, x) \u2212 kt(x)T (Kt + \u03bbIt)\u22121kt(x) are the posterior mean and variance functions after t obser-\nvations from a function drawn from the GP prior GPX (0, k), with additive i.i.d. Gaussian noise\nN (0, \u03bb). Here Yt = [y1, . . . , yt]T is the vector formed by observations, Kt = [k(u, v)]u,v\u2208Xt is\nthe kernel matrix computed at the set Xt = {x1, . . . , xt}, kt(x) = [k(x1, x), . . . , k(xt, x)]T and\nIt is the identity matrix of order t. If the noise \u03b7t is assumed conditionally R-sub-Gaussian, i.e.,\nensures\n\u221a\n\u02dcO(\nT ) regret [11], as the posterior GP concentrates rapidly on the true function f. However, when\nthe sub-Gaussian assumption does not hold, we cannot expect the posterior GP to have such nice\nconcentration property. In fact, it is known that the ridge regression estimator \u00b5t \u2208 Hk(X ) of f is not\nrobust when the noise exhibits heavy \ufb02uctuations [19]. So, in order to tackle heavy tailed noise, one\n\nneeds more robust estimates(cid:98)\u00b5t of f along with suitable con\ufb01dence sets. A natural idea to curb the\n\nR(cid:112)ln|It + \u03bb\u22121Kt|(cid:17)\n\nfor all \u03b3 \u2208 R, then using \u03b2t+1 = O\n\nE(cid:2)e\u03b3\u03b7t (cid:12)(cid:12) Ft\u22121\n\neffects of heavy \ufb02uctuations is to truncate high rewards [8]. Our \ufb01rst algorithm Truncated GP-UCB\n(Algorithm 1) is based on this idea.\nTruncated GP-UCB (TGP-UCB) algorithm:\nAt each time t, we truncate the reward yt to zero if\nit is larger than a suitably chosen truncation level bt,\n\n(cid:16)\n\n2\n\nThen, we construct the truncated version of the\n\ni.e., we set the truncated reward(cid:98)yt = yt1|yt|\u2264bt.\nposterior mean as(cid:98)\u00b5t(x) = kt(x)T (Kt + \u03bbIt)\u22121(cid:98)Yt\nwhere(cid:98)Yt = [(cid:98)y1, . . . ,(cid:98)yt]T and simply run GP-UCB\nwith(cid:98)\u00b5t instead of \u00b5t. The truncation level bt can\n\nAlgorithm 1 Truncated GP-UCB (TGP-UCB)\nSet(cid:98)\u00b50(x) = 0 and \u03c32\nInput: Parameters \u03bb > 0, {bt}t\u22651, {\u03b2t}t\u22651\n0(x) = k(x, x)\u2200x \u2208 X\nPlay xt = argmaxx\u2208X(cid:98)\u00b5t\u22121(x) + \u03b2t\u03c3t\u22121(x)\nfor t = 1, 2, 3 . . . do\nSet(cid:98)yt = yt1|yt|\u2264bt and (cid:98)Yt = [(cid:98)y1, . . . ,(cid:98)yt]T\nCompute (cid:98)\u00b5t(x) = kt(x)T (Kt + \u03bbIt)\u22121(cid:98)Yt\n\nand observe payoff yt\n\nand \u03c32\nend for\n\nt (x) = kt(x)T (Kt + \u03bbIt)\u22121kt(x)\n\nbe adapted with time t. We choose an increasing\nsequence of bt\u2019s, i.e., as time progresses and con\ufb01-\ndence interval shrinks, we truncate less and less\naggressively. Finally, in order to account for the bias introduced by truncation, we blow up the\ncon\ufb01dence width \u03b2t of GP-UCB by a multiplicative factor of bt so that f (x) is contained in the\n\ninterval(cid:98)\u00b5t\u22121(x) \u00b1 \u03b2t\u03c3t\u22121(x) with high probability. This helps us to obtain a sub-linear regret bound\nfor TGP-UCB given in the Theorem 1, with a full proof deferred to appendix B.\n(cid:105) \u2264 v < \u221e for some \u03b1 \u2208 (0, 1] and for all t \u2265 1. Then, for any\nx \u2208 X . Let E(cid:104)|yt|1+\u03b1 |Ft\u22121\nTheorem 1 (Regret bound for TGP-UCB) Let f \u2208 Hk(X ), (cid:107)f(cid:107)H \u2264 B and k(x, x) \u2264 1 for all\n(cid:112)ln|It + \u03bb\u22121Kt| + 2 ln(1/\u03b4),\n\n\u03b4 \u2208 (0, 1], TGP-UCB, with bt = v\n\n2(1+\u03b1) and \u03b2t+1 = B + 3\u221a\n\n1+\u03b1 t\n\nbt\n\n1\n\n1\n\n\u03bb\n\n3\n\n\f(cid:18)\n(cid:16)\n\n1. E[RT ] = \u2126\n\n2. E[RT ] = \u2126\n\n(cid:19)\n\n1\n\n(cid:17)\n\nenjoys, with probability at least 1 \u2212 \u03b4, the regret bound\n\n(cid:16)\n\nB(cid:112)T \u03b3T + v\n\n1+\u03b1(cid:112)\u03b3T (\u03b3T + ln(1/\u03b4))T\n2 ln(cid:12)(cid:12)It + \u03bb\u22121KA\n\n1\n\nRT = O\n\n(cid:12)(cid:12).\nis compact and convex, then \u03b3T = O(cid:0)(ln T )d+1(cid:1) for kSE and O(cid:0)T\n\nwhere \u03b3T \u2261 \u03b3T (k,X ) = maxA\u2282X :|A|=T\nHere, \u03b3T denotes the maximum information gain about any f \u223c GPX (0, k) after T noisy observations\nobtained by passing f through an i.i.d. Gaussian channel N (0, \u03bb), and measures the reduction in the\nuncertainty of f after T noisy observations. It is a property of the kernel k and domain X , e.g., if X\n\n2\u03bd+d(d+1) ln T(cid:1) for kMat\u00e9rn [35].\n\nd(d+1)\n\n1\n\n(cid:17)\n\n,\n\n2+\u03b1\n\n2(1+\u03b1)\n\nRemark 1. An R-sub-Gaussian environment satis\ufb01es the moment condition with \u03b1 = 1 and v = R2,\nso the result implies a sub-linear \u02dcO(T 3/4) regret bound for TGP-UCB in sub-Gaussian environments.\n\n4 Regret lower bound\n\nEstablishing lower bounds under general kernel smoothness structure is an open problem even when\nthe payoffs are Gaussian. Similar to Scarlett et al. [31], we only focus on the SE and Mat\u00e9rn kernels.\nTheorem 2 (Lower bound on cumulative regret) Let X = [0, 1]d for some d \u2208 N. Fix a kernel\nk \u2208 {kSE, kMat\u00e9rn}, B > 0, T \u2208 N, \u03b1 \u2208 (0, 1] and v > 0. Given any algorithm, there exists a function\n\nf \u2208 Hk(X ) with (cid:107)f(cid:107)H \u2264 B, and a reward distribution satisfying E(cid:104)|yt|1+\u03b1 |Ft\u22121\n\n(cid:105) \u2264 v for all\n\nt \u2208 [T ] := {1, 2, . . . , T}, such that when the algorithm is run with this f and reward distribution, its\nregret satis\ufb01es\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)(cid:17) d\u03b1\n\n1+\u03b1\n\n1\n\n1+\u03b1\n\nv\n\nln\n\nv\u2212 1\n\n\u03b1 B\n\n1+\u03b1\n\u03b1 T\n\nT\n\n1+\u03b1\n\nif k = kSE,\n\n\u03bd\n\n\u03bd(1+\u03b1)+d\u03b1 B\n\nd\u03b1\n\n\u03bd(1+\u03b1)+d\u03b1 T\n\n\u03bd+d\u03b1\n\n\u03bd(1+\u03b1)+d\u03b1\n\nv\n\nif k = kMat\u00e9rn.\n\nThe proof argument is inspired by that of Scarlett et al. [31], which provides the lower bound of\nBO under i.i.d. Gaussian noise, but with nontrivial changes to account for heavy tailed observations.\nThe proof is based on constructing a \ufb01nite subset of \u201cdif\ufb01cult\u201d functions in Hk(X ). Speci\ufb01cally,\nwe choose f as a uniformly sampled function from a \ufb01nite set {f1, . . . , fM}, where each fj is\nobtained by shifting a common function g \u2208 Hk(Rd) by a different amount such that each of these\nhas a unique maximum, and then cropping to X = [0, 1]d. g takes values in [\u22122\u2206, 2\u2206] with the\nmaximum attained at x = 0. The function g is constructed properly, and the parameters \u2206, M\nreward function takes values in {sgn (f (x))(cid:0) v\nare chosen appropriately based on the kernel k, \ufb01xed constants B, T, \u03b1, v such that any \u2206-optimal\npoint for fj fails to be \u2206-optimal point for any other fj(cid:48) and that (cid:107)fj(cid:107)H \u2264 B for all j \u2208 [M ]. The\n(cid:1) 1\n(cid:0) 2\u2206\n\u03b1 , 0}, with the former occurring with probability\n\u03b1 |f (x)|, such that, for every x \u2208 X , the expected reward is f (x) and (1 + \u03b1)-th raw moment\nis upper bounded by v. Now, if we can lower bound the regret averaged over j \u2208 [M ], then there\nmust exist some fj for which the bound holds. The formal proof is deferred to Appendix C.\n\u221a\nRemark 2. Theorem 2 suggests that (a) TGP-UCB may be suboptimal, and (b) for the SE kernel, it\nmay be possible to design algorithms recovering \u02dcO(\nT ) regret bound under \ufb01nite variances (\u03b1 = 1).\n\n(cid:1) 1\n\n2\u2206\n\nv\n\n5 An optimal algorithm under heavy tailed rewards\n\nIn view of the gap between the regret bound for TGP-UCB and the fundamental lower bound, it\nis possible that TGP-UCB (Algorithm 1) does not completely mitigate the effect of heavy-tailed\n\ufb02uctuations, and perhaps that truncation in a different domain may work better. In fact, for parametric\nlinear bandits (i.e., BO with \ufb01nite dimensional linear kernels), it has been shown that appropriate\ntruncation in feature space improves regret performance as opposed to truncating raw observations\n[34], and in this case the feature dimension explicitly appears in the regret bound. However, the main\nchallenge in the more general nonparametric setting is that the feature space is in\ufb01nite dimensional,\n\n4\n\n\fwhich would yield a trivial regret upper bound. If we can \ufb01nd an approximate feature map \u02dc\u03d5 : X \u2192\nRm in a low-dimensional Euclidean inner product space Rm such that k(x, y) \u2248 \u02dc\u03d5(x)T \u02dc\u03d5(y), then\nwe can perform the above feature adaptive truncation effectively as well as keep the error introduced\ndue to approximation in control. Such a kernel approximation can be done ef\ufb01ciently either in a\ndata independent way (Fourier features approximation [28]) or in a data dependent way (Nystr\u00f6m\napproximation [12]) and has been used in the context of BO to reduce the time complexity of GP-UCB\n[26, 9]. But in this work, the approximations are crucial to obtain optimal theoretical guarantees. We\nnow describe our algorithm Adaptively Truncated Approximate GP-UCB (Algorithm 2).\n\n5.1 Adaptively Truncated Approximate GP-UCB (ATA-GP-UCB) algorithm\n\nt\n\nt\n\nt\n\n(cid:1)d\n\n2\u03c0\n\ne\u2212 l2(cid:107)\u03c9(cid:107)2\n\n2\n\n2\n\nAt each round t, we select an arm xt which maximizes the approximate (under kernel approximation)\nGP-UCB score \u02dc\u00b5t\u22121(x) + \u03b2t \u02dc\u03c3t\u22121(x), where \u02dc\u00b5t\u22121(x) and \u02dc\u03c32\nt\u22121(x) denote approximate posterior\nmean and variance from the previous round, respectively and \u03b2t is an appropriately chosen con\ufb01dence\nt (x) as follows. First, we \ufb01nd a feature embedding \u02dc\u03d5t \u2208 Rmt,\nwidth. Then, we update \u02dc\u00b5t(x) and \u02dc\u03c32\nof some appropriate dimension mt, which approximates the kernel ef\ufb01ciently. Then, we \ufb01nd the rows\nmt of the matrix \u02dcV\n\u02dc\u03a6t + \u03bbImt,\nuT\n1 , . . . , uT\nand use those as the weight vectors for truncating the rewards in each of mt directions by setting\n[(cid:98)r1, . . . ,(cid:98)rmt]T . Finally, we approximate the posterior mean\n\u03c4 =1 ui,\u03c4 y\u03c4 1|ui,\u03c4 y\u03c4|\u2264bt for all i \u2208 [mt], where bt speci\ufb01es the truncation level. Then, we\nt (x) = k(x, x) \u2212 \u02dc\u03d5t(x)T \u02dc\u03d5t(x) + \u03bb \u02dc\u03d5t(x)T \u02dcV \u22121\n\nt , where \u02dc\u03a6t = [ \u02dc\u03d5t(x1), . . . , \u02dc\u03d5t(xt)]T and \u02dcVt = \u02dc\u03a6T\n\n\u02dc\u03d5t(x) for the Fourier\n\u02dc\u03d5t(x) for the\n\n(cid:98)ri =(cid:80)t\n\n\u22121/2\nt\n\n\u02dc\u03a6T\n\nt (x) = \u03bb \u02dc\u03d5t(x)T \u02dcV \u22121\n\n\u22121/2\n\ufb01nd our estimate of f as \u02dc\u03b8t = \u02dcV\nt\nas \u02dc\u00b5t(x) = \u02dc\u03d5t(x)T \u02dc\u03b8t and the posterior variance as (i) \u02dc\u03c32\nfeatures approximation, or as (ii) \u02dc\u03c32\nNystr\u00f6m approximation. Now it only remains to describe how to \ufb01nd the feature embeddings \u02dc\u03d5t.\n(a) Quadrature Fourier features (QFF) approximation: If k is a bounded, continuous, positive\nde\ufb01nite, stationary kernel satisfying k(x, x) = 1, then by Bochner\u2019s theorem [4], k is the Fourier\nRd p(\u03c9) cos(\u03c9T (x \u2212 y))d\u03c9. For the SE\n(abusing notation for measure and density).\nMutny and Krause [26] show that for any stationary kernel k on Rd whose inverse Fourier transform\nj=1 pj(\u03c9j), we can use Gauss-Hermite quadrature [18]\nto approximate it. If X = [0, 1]d, the SE kernel is approximated as follows. Choose \u00afm \u2208 N and\nm = \u00afmd, and construct the 2m-dimensional feature map\n\ntransform of a probability measure p, i.e., k(x, y) = (cid:82)\nkernel, this measure has density p(\u03c9) =(cid:0) l\u221a\ndecomposes product wise, i.e., p(\u03c9) =(cid:81)d\n(cid:16)\u221a\n(cid:112)\u03bd(\u03c9i) cos\n(cid:16)\u221a\n(cid:112)\u03bd(\u03c9i\u2212m) sin\n(cid:123)\n(cid:125)(cid:124)\n(cid:122)\nHermite polynomial H \u00afm, and \u03bd(z) =(cid:81)d\nA \u00afm \u00d7 \u00b7\u00b7\u00b7 \u00d7 A \u00afm, where A \u00afm is the set of \u00afm (real) roots of the \u00afm-th\n\u00afm2H \u00afm\u22121(zj )2 for all z \u2208 Rd. For our purposes, we will\nhave ATA-GP-UCB work with the embedding \u02dc\u03d5t(x) = \u02dc\u03d5(x) of dimension mt = 2m for all t \u2265 1.\nRemark 3. The seminal work of Rahimi and Recht [28] that develops random Fourier fea-\nture (RFF) approximation of any stationary kernel is based on the feature map \u02dc\u03d5(x) =\n1\u221a\nmx)]T , where each \u03c9i is sampled independently\nm [cos(\u03c9T\nfrom p(\u03c9). However, RFF embeddings do not appear to be useful for our purpose of achieving\nsublinear regret (see discussion after Lemma 1), so we work with the QFF embedding.\n(b) Nystr\u00f6m approximation: Unlike the QFF approximation where the basis functions (cosine\nand sine) do not depend on the data, the basis functions used by the Nystr\u00f6m method are data\ndependent. For a set of points Xt = {x1, . . . , xt}, the Nystr\u00f6m method [42] approximates the\nkernel matrix Kt as follows: First sample a random number mt of points from Xt to construct\na dictionary Dt = {xi1 , . . . , ximt\n}; ij \u2208 [t], according to the following distribution. For each\nt\u22121(xi), 1} for a suitably\ni \u2208 [t], include xi in Dt independently with probability pt,i = min{q\u02dc\u03c32\n(cid:16)\nchosen parameter q (which trades off between the quality and the size of the embedding). Then,\ncompute the (approximate) \ufb01nite-dimensional feature embedding \u02dc\u03d5t(x) =\nkDt(x), where\n\nif 1 \u2264 i \u2264 m,\nif m + 1 \u2264 i \u2264 2m.\n\nHere the set {\u03c91, . . . , \u03c9m} =\n\n1 x), . . . , cos(\u03c9T\n\n1 x), . . . , sin(\u03c9T\n\nmx), sin(\u03c9T\n\ni x\n2\nl \u03c9T\n\n2 \u00afm\u22121 \u00afm!\n\nj=1\n\n(cid:17)\u2020\n\nK 1/2Dt\n\n\uf8f1\uf8f2\uf8f3\n\n\u02dc\u03d5(x)i =\n\n(1)\n\n(cid:17)\n\n(cid:17)\n\ni\u2212mx\n\n2\n\nl \u03c9T\n\nd times\n\n5\n\n\fKDt = [k(u, v)]u,v\u2208Dt, kDt(x) = [k(xi1 , x), . . . , k(ximt\nof any matrix A. We call the entire procedure Nystr\u00f6mEmbedding (pseudocode in appendix).\n\n, x)]T and A\u2020 denotes the pseudo inverse\n\nAlgorithm 2 Adaptively Truncated Approximate GP-UCB (ATA-GP-UCB)\n\nInput: Parameters \u03bb > 0, {bt}t\u22651, {\u03b2t}t\u22651, q, a kernel approximation (QFF or Nystr\u00f6m)\nSet: \u02dc\u00b50(x) = 0 and \u02dc\u03c32\nfor t = 1, 2, 3 . . . do\n\n0(x) = k(x, x) for all x \u2208 X\n\nPlay xt = argmaxx\u2208X \u02dc\u00b5t\u22121(x) + \u03b2t \u02dc\u03c3t\u22121(x) and observe payoff yt\nSet \u02dc\u03d5t(x) =\n\n(cid:26) \u02dc\u03d5(x)\ni=1, q(cid:1)\nNystr\u00f6mEmbedding(cid:0){(xi, \u02dc\u03c3t\u22121(xi))}t\nt and set(cid:98)ri =(cid:80)t\n[(cid:98)r1, . . . ,(cid:98)rmt ]T and compute \u02dc\u00b5t(x) = \u02dc\u03d5t(x)T \u02dc\u03b8t\n(cid:26)(i) \u03bb \u02dc\u03d5t(x)T \u02dcV \u22121\n\nt = [ \u02dc\u03d5t(x1), . . . , \u02dc\u03d5t(xt)] and \u02dcVt = \u02dc\u03a6T\n\n1 , . . . , uT\n\nmt of \u02dcV\n\n\u22121/2\nt\n\n\u02dc\u03d5t(x)\n\n\u02dc\u03a6T\n\nt\n\n(ii) k(x, x) \u2212 \u02dc\u03d5t(x)T \u02dc\u03d5t(x) + \u03bb \u02dc\u03d5t(x)T \u02dcV \u22121\n\nt\n\nSet \u02dc\u03a6T\nFind the rows uT\n\u22121/2\nSet \u02dc\u03b8t = \u02dcV\nt\nSet \u02dc\u03c32\nend for\n\nt (x) =\n\n\u02dc\u03d5t(x)\n\nt\n\nif QFF approximation\nif Nystr\u00f6m approximation\n\u02dc\u03a6t + \u03bbImt, where mt is the dimension of \u02dc\u03d5t\n\n\u03c4 =1 ui,\u03c4 y\u03c4 1|ui,\u03c4 y\u03c4|\u2264bt for all i \u2208 [mt]\n\nif QFF approximation\nif Nystr\u00f6m approximation\n\nRemark 4. It is well known (\u03bb-ridge leverage score sampling [1]) that, by sampling points proportional\nt (x), one can obtain an accurate embedding \u02dc\u03d5t(x), which in turn gives\nto their posterior variances \u03c32\nt (x) in turn requires inverting Kt, which\nan accurate approximation \u02dc\u03c32\nt (x). But, computation of \u03c32\ntakes at most O(t3) time. So, we make use of the already computed approximations \u02dc\u03c32\nt\u22121(x) to\nsample points at round t, without signi\ufb01cantly compromising on the accuracy of the embeddings [9].\nRemark 5. The choice (i) of \u02dc\u03c32\nt (x) in Algorithm 2 ensures accurate estimation of the variance of\nx under the QFF approximation [26]. But, the same choice leads to severe underestimation of the\nvariance under the Nystr\u00f6m approximation, specially when x is far away from Dt. The choice (ii) of\nt (x) in Algorithm 2 is known as deterministic training conditional in the GP literature [27] and\n\u02dc\u03c32\nprovably prevents the phenomenon of variance starvation under Nystr\u00f6m approximation [9].\n\n5.2 Cumulative regret of ATA-GP-UCB with QFF embeddings\n\nThe following lemma shows that the data adaptive truncation of all the historical rewards and a good\napproximation of the kernel help us obtain a tighter con\ufb01dence interval than TGP-UCB.\nLemma 1 (Tighter con\ufb01dence sets with QFF truncation) For any \u03b4 \u2208 (0, 1], ATA-GP-UCB\n1\u2212\u03b1\nwith QFF approximation and parameters bt = (v/ ln(2mT /\u03b4))\n2(1+\u03b1) and \u03b2t+1 = B +\n1\u2212\u03b1\n2(1+\u03b1) , ensures that with probability at least 1 \u2212 \u03b4, uniformly\n1\n1+\u03b1 (ln(2mT /\u03b4))\nover all t \u2208 [T ] and x \u2208 X ,\n\n4(cid:112)m/\u03bb v\n\n\u03b1\n1+\u03b1 t\n\n1\n1+\u03b1 t\n\nwhere the QFF dimension m is such that supx,y\u2208X\n\n|f (x) \u2212 \u02dc\u00b5t\u22121(x)| \u2264 \u03b2t \u02dc\u03c3t\u22121(x) + O(B\u03b51/2\n\n(cid:12)(cid:12)k(x, y) \u2212 \u02dc\u03d5(x)T \u02dc\u03d5(y)(cid:12)(cid:12) =: \u03b5m < 1.\n\nm t2),\n\n(2)\n\n1\n\n1\n\n1\u2212\u03b1\n2(1+\u03b1) of the con\ufb01dence width \u03b2t is much less than the scaling t\n\n\u221a\n1+\u03b1 ), since sum of the approximate posterior standard deviations grows only as \u02dcO(\n\nHere, the scaling t\n2(1+\u03b1) of TGP-UCB,\nwhich eventually leads to a tighter con\ufb01dence interval. However, in order to achive sublinear cumula-\ntive regret, we need to ensure that the approximation error \u03b5m decays at least as fast as O(1/T 6) and\nfeature dimension m grows no faster than polylog(T ). This will ensure that the regret accumulated\ndue to the second term in the RHS of 2 is O(1), as well as the contribution from the \ufb01rst term is\n\u02dcO(T\nmT ). Now,\nthe QFF embedding (1) of kSE can be shown to achieve \u03b5m \u2264 d2d\u22121\n[26]. The decay is exponential when \u00afm > 1/l2 and d = O(1)2. Now, setting \u00afm = \u0398\nwe can ensure that \u03b51/2\n3. The following theorem states this formally, with a full proof deferred to Appendix D.2.\n\n(cid:16) d2d\u22121\n(cid:17)\n(cid:0)T 6(cid:1)(cid:17)\nm T 3 = O(1) and m = O(cid:0)(ln T )d(cid:1), and thus, in turn, a sublinear regret bound\n3 Under RFF approximation \u03b5m = \u02dcO((cid:112)1/m) [36]. Hence, ATA-GP-UCB does not achieve sublinear regret.\n\n2For most BO applications, the effective dimensionality of the problem is low, e.g., additive models [21, 30].\n\n(cid:0) e\n\n(cid:1) \u00afm\n\nlog4/e\n\n(cid:16)\n\n( \u00afml2) \u00afm\n\n= O\n\n1\u221a\n\n,\n\n2 \u00afm \u00afm\n\n4l2\n\n6\n\n\fTheorem 3 (Regret bound for ATA-GP-UCB with QFF embedding) Fix any \u03b4 \u2208 (0, 1]. Then,\nunder the same hypothesis of Theorem 1, for X = [0, 1]d and k = kSE, ATA-GP-UCB under QFF\napproximation, with parameters bt and \u03b2t set as in Lemma 1, and with the embedding \u02dc\u03d5 from 1 such\n, enjoys, with probability at least 1 \u2212 \u03b4, the regret bound\nthat \u00afm > 1/l2 and \u00afm = \u0398\n\n(cid:0)T 6(cid:1)(cid:17)\n\n(cid:16)\n\nlog4/e\n\n(cid:33)\n\nRT = O\n\nB\n\nT (ln T )d+1 + v\n\n1\n\n1+\u03b1\n\nln\n\nln T (ln T )d T\n\n1\n\n1+\u03b1\n\n.\n\n(cid:32)\n\n(cid:113)\n\n(cid:18)\n\n(cid:18) T (ln T )d\n\n(cid:19)(cid:19) \u03b1\n\n1+\u03b1 \u221a\n\n\u03b4\n\nRemark 6. When the variance of the rewards is \ufb01nite (i.e., \u03b1 = 1), the cumulative regret for ATA-\nGP-UCB under QFF approximation of the SE kernel is O((ln T )d+1\nT ), which now recovers the\nstate-of-the-art regret bound of GP-UCB under sub-Gaussian rewards [26, Corollary 2] unlike the\nearlier TGP-UCB. It is worth pointing out that the bound in Theorem 3 is only for the SE kernel\nde\ufb01ned on X = [0, 1]d, and designing a no-regret BO strategy under the QFF approximation of any\nother stationary kernel still remains a open question even when the rewards are sub-Gaussian [26].\n\n\u221a\n\n5.3 Cumulative regret of ATA-GP-UCB with Nystr\u00f6m embeddings\n\n1\n\n\u03bb\n\n1+\u03b5 \u03c32\n\nt (x) \u2264 \u02dc\u03c32\n\nt (x) \u2264 1+\u03b5\n\n1\u2212\u03b5 \u03c32\n\n(cid:0)1 + 1\n\nT \u03b3T ), we can achieve the optimal \u02dcO(T\n\n(cid:1) q\u03b3t and 1\u2212\u03b5\n\nNow, we will show that ATA-GP-UCB under Nystr\u00f6m approximation achives optimal regret for\nany stationary kernel de\ufb01ned on X \u2282 Rd without any restriction on d. Similar to Lemma 1, ATA-\nGP-UCB under Nystr\u00f6m approximation also maintains tighter con\ufb01dence sets than TGP-UCB. As\nbefore, the con\ufb01dence sets are useful only if the dimension of the embeddings mt grows no faster\nthan polylog(t). Not only that, we also need to ensure that the approximate posterior variances\nare only a constant factor away from the exact ones. Then, since sum of the posterior standard\n\u221a\ndeviations grows only as O(\n1+\u03b1 ) regret scaling. Now for\nany \u03b5 \u2208 (0, 1), setting q = 6 1+\u03b5\n1\u2212\u03b5 ln(2T /\u03b4)/\u03b52, the Nystr\u00f6m embeddings \u02dc\u03d5t can be shown to achieve\nmt \u2264 6 1+\u03b5\nt (x) with probability at least 1 \u2212 \u03b4 [9],\n1\u2212\u03b5\nwhich helps us to achieve an optimal regret bound. The following theorem states this formally, with a\nfull proof deferred to Appendix D.3.\nTheorem 4 (Regret bound for ATA-GP-UCB with Nystr\u00f6m embedding) Fix any \u03b4 \u2208 (0, 1], \u03b5 \u2208\n(0, 1) and set \u03c1 = 1+\u03b5\n1\u2212\u03b5 . Then, under the same hypothesis of Theorem 1, ATA-GP-UCB under Nystr\u00f6m\n1\u2212\u03b1\napproximation, and with parameters q = 6\u03c1 ln(4T /\u03b4)/\u03b52, bt = (v/ ln(4mtT /\u03b4))\n2(1+\u03b1) and\n1\u2212\u03b1\n\u03b2t+1 = B(1 + 1\u221a\n2(1+\u03b1) , enjoys, with probability at least\n1\u2212\u03b5\n1 \u2212 \u03b4, the regret bound\n\n(cid:33)\n(cid:19)(cid:19) \u03b1\n1+\u03b1(cid:112)ln(T /\u03b4)\u03b3T T\n1+\u03b1(cid:1) regret bound\n\n(cid:18) \u03b3T ln(T /\u03b4)T\nRemark 7. Theorems 3 and 4 imply that ATA-GP-UCB achieves \u02dcO(cid:0)v\n(cid:1), which is sublinear only when d(d+1)\n\u02dcO(cid:0)T\n\nfor kSE, which matches the lower bound (Theorem 2) upto a factor of \u03b1\n1+\u03b1 in the exponent of ln T ,\nas well as a few extra ln T factors hidden in the notation \u02dcO. For the Mat\u00e9rn kernel, the bound is\n2\u03bd < \u03b14, and the gap from the lower\nbound is more signi\ufb01cant in this case. It is worth mentioning that a similar gap is present even for\nthe (easier) setting of sub-Gaussian rewards [31] and there might exist better algorithms which can\nbridge this gap. When the variance of the rewards is \ufb01nite (i.e., \u03b1 = 1), the cumulative regret for\nATA-GP-UCB under Nystr\u00f6m approximation is \u02dcO(\u03b3T\nT ), which recovers the state-of-the-art regret\nbound under sub-Gaussian rewards [9, Thm. 2]. For the linear bandit setting, i.e. when the feature\nmap \u02dc\u03d5t(x) = x itself, substituting \u03b3T = O(d ln T ), we \ufb01nd that the regret upper bound in Theorem\n4 recovers the (optimal) regret bound of [34, Thm. 3] up to a logarithmic factor.\n\n) + 4(cid:112)mt/\u03bb v\n(cid:19)(cid:112)T \u03b3T +\n\n1\u221a\n1 \u2212 \u03b5\n\n1+\u03b1 (ln(4mtT /\u03b4))\n\n1+\u03b1 (ln T )dT\n\nRT = O\n\n(cid:32)\n\n\u03b1\n1+\u03b1 t\n\n\u03c1B\n\n1 +\n\n1\n1+\u03b1 t\n\n1\n\n1+\u03b1\n\n2\u03bd+(2+\u03b1)d(d+1)\n\n2\u03bd+d(d+1)\n\n(cid:18)\n\n1\n\n1+\u03b1\n\nv\n\nln\n\n(cid:18)\n\n\u03c12\n\u03b5\n\n\u221a\n\n1\n\n1\n\n\u03b4\n\n1\n\n1\n\n1+\u03b1\n\n.\n\n5.4 Computational complexity of ATA-GP-UCB\n\n(a) Time complexity: Under the (data-dependent) Nystr\u00f6m approximation, constructing the dictio-\nnary Dt takes O(t) time at each step t. Then, we compute the embeddings \u02dc\u03d5t(x) for all arms in\n4This holds, for example, Mat\u00e9rn kernel on R2 with \u03bd = 3.5 when variance of the rewards is \ufb01nite (\u03b1 = 1).\n\n7\n\n\f(a) kSE, f \u2208 RKHS, Student\u2019s-t\n\n(b) kSE, f \u2208 RKHS, Pareto\n\n(c) kMat\u00e9rn, f \u2208 RKHS, Student\u2019s-t\n\n(d) Stock market data\n\n(e) Light sensor data\n\n(f) Effect of truncation on GP-UCB\n\nFigure 1: (a)-(e) Time-average regret (RT /T ) for TGP-UCB, ATA-GP-UCB with QFF approximation (ATA-\nGP-UCB-QFF) and Nystr\u00f6m approximation (ATA-GP-UCB-Nystr\u00f6m) on heavy-tailed data. (f) Con\ufb01dence sets\n(\u00b5t \u00b1 \u03c3t) formed by GP-UCB with and without truncation under heavy \ufb02uctuations.\n\nt (x) for all arms in O(m2\n\nis computed in O(m3\n\nt t + mt |X|) and O(m2\n\nt + m2\n\nt (t + |X|)(cid:1), since\n\n. Thus per-step time complexity is O(cid:0)m2\n\nt |X|) time, where |X| is the cardinality of X . Now, construction of \u02dcVt takes O(m2\nO(m3\nt t)\n\u22121/2\ntime, since we need to rebuild it from the scratch. Then, \u02dcV\nt ) time. We can\nt |X|) time, respectively,\nt\nnow compute \u02dc\u00b5t(x) and \u02dc\u03c32\n\u22121/2\nusing already computed \u02dc\u03d5t(x) and \u02dcV\nmt \u2264 t. For continuous X , one can approximately maximize the GP-UCB score by grid search /\nt\nBranch and Bound methods such as DIRECT [6]. In fact it can be maximized within O(\u03b5) accu-\nt (t + \u03b5\u2212d)). Since\nracy by making O(\u03b5\u2212d) calls to it, yielding a per-step time complexity of O(m2\nmt = \u02dcO(\u03b3t) and \u03b3t is poly-logarithmic in t for SE kernel, per step time complexity is \u02dcO(t + \u03b5\u2212d).\nFor Mat\u00e9rn kernel, the complexity is \u02dcO(tp(t + \u03b5\u2212d)), 1 < p < 2. Similarly, under (data-independent)\nQFF approximation, the per-step time complexity is O(m3 + m2(t + \u03b5\u2212d)) = \u02dcO(t + \u03b5\u2212d) since\nm = O((ln T )d) for the SE kernel.\n(b) Space complexity: Note that under Nystr\u00f6m approximation, at each round t we need to store\n\u22121/2\nall previously chosen arms, the matrix \u02dcV\nand the vectors \u02dc\u03d5t(x). Hence, the per-step space\nt\ncomplexity of ATA-GP-UCB is O(t + mt(mt + \u03b5\u2212d)) = O(mt(mt + \u03b5\u2212d)) for small enough \u03b5.\nUnder QFF approximation, the complexity is O(m(m + \u03b5\u2212d)).\n\n6 Experiments\n\nWe numerically compare the performance of TGP-UCB (Algorithm 1), ATA-GP-UCB with QFF\n(ATA-GP-UCB-QFF) and Nystr\u00f6m (ATA-GP-UCB-Nystr\u00f6m) approximations (Algorithm 2) on both\nsynthetic and real-world heavy-tailed environments. (Our codes are available here.) The con\ufb01dence\nwidth \u03b2t and truncation level bt of our algorithms, and the trade-off parameter q used in Nystr\u00f6m\napproximation are set order-wise similar to those recommended by theory (Theorems 1, 3 and 4).\nWe use \u03bb = 1 in all algorithms and \u03b5 = 0.1 in ATA-GP-UCB-Nystr\u00f6m. We plot the mean and\nstandard deviation (under independent trials) of the time-average regret RT /T in Figure 1. We use\nthe following datasets.\n1. Synthetic data: We generate the objective function f \u2208 Hk(X ) with X set to be a discretization\ni=1 aik(\u00b7, xi) was generated using an SE kernel\nwith l = 0.2 and by uniformly sampling ai \u2208 [\u22121, 1] and support points xi \u2208 X with p = 100. We\nset B = maxx\u2208X |f (x)|. To generate the rewards, \ufb01rst we consider y(x) = f (x) + \u03b7, where the\nnoise \u03b7 are samples from the Student\u2019s t-distribution with 3 degrees of freedom (Figure 1 a). Here, the\n\nof [0, 1] into 100 evenly spaced points. Each f =(cid:80)p\n\n8\n\n\u22120.500.511.522.5x 1040.20.30.40.50.60.70.80.9Time-averageRegretRounds TGP-UCBATA-GP-UCB-QFFATA-GP-UCB-Nystr\u00a8om\u22120.500.511.522.5x 1040.150.20.250.3Time-averageRegretRounds TGP-UCBATA-GP-UCB-QFFATA-GP-UCB-Nystr\u00a8om\u22120.500.511.522.5x 1040.350.40.450.50.550.60.650.70.750.8Time-averageRegretRounds TGP-UCBATA-GP-UCB-Nystr\u00a8om0200040006000800010000120000.250.30.350.40.450.5Time-averageRegretRounds TGP-UCBATA-GP-UCB-Nystr\u00a8om0200040006000800010000120000.350.40.450.50.550.60.65Time-averageRegretRounds TGP-UCBATA-GP-UCB-QFFATA-GP-UCB-Nystr\u00a8om00.10.20.30.40.50.60.70.80.9100.20.40.60.811.21.41.61.8f(x)x TruefunctionNotruncationTruncation\fvariance is bounded (\u03b1 = 1) and hence v = B2 + 3. Next, we generate the rewards as samples from\nthe Pareto distribution with shape parameter 2 and scale parameter f (x)/2. f is generated similarly,\nexcept that here we sample ai\u2019s uniformly from [0, 1]. Then, we set B as before leading to the bound\nof (1 + \u03b1)-th raw moments v = B1+\u03b1\n2\u03b1(1\u2212\u03b1). We plot the results for \u03b1 = 0.9 (Figure 1 b). We use\nm = 32 features (in consistence with Theorem 3) for ATA-GP-UCB-QFF in these experiments. Next,\nwe generate f using the Mat\u00e9rn kernel with l = 0.2 and \u03bd = 2.5, and consider the same Student\u2019s-t\ndistribution as earlier to generate rewards. As we do not have the theory of ATA-GP-UCB-QFF for\nthe Mat\u00e9rn kernel yet, we exclude evaluating it here (Figure 1 c). We perform 20 trials for 2 \u00d7 104\nrounds and for each trial we evaluate on a different f (which explains the high error bars).\n2. Stock market data: We consider a representative application of identifying the most pro\ufb01table\nstock in a given pool of stocks. This is motivated by the practical scenario that an investor would\nlike to invest a \ufb01xed budget of money in a stock and get as much return as possible. We took\nthe adjusted closing price of 29 stocks from January 4th, 2016 to April 10th, 2019. We conduct\nKolmogrov-Smirnov (KS) test to \ufb01nd out that the null hypothesis of stock prices following a Gaussian\ndistribution is rejected against the favor of a heavy-tailed distribution. We take the empirical mean of\nstock prices as our objective function f and empirical covariance of the normalized stock prices as\nour kernel function k (since stock behaviors are mostly correlated with one another). We consider\n\u03b1 = 1 and set v as the empirical average of the squared prices. Since the kernel is data dependent,\nwe cannot run ATA-GP-UCB-QFF here. We average over 10 independent trials of the algorithms\n(Figure 1 d).\n3. Light sensor data: We take light sensor data collected in the CMU Intelligent Workplace in Nov\n2005 containing locations of 41 sensors, 601 train samples and 192 test samples in the context of\nlearning the maximum average reading of the sensors. For each sensor, we \ufb01nd that the KS test\non its readings rejects the Gaussian against the favor of a heavy-tailed distribution. We take the\nempirical average of the test samples as our objective f and empirical covariance of the normalized\ntrain samples as our kernel k. We consider \u03b1 = 1, set v as the empirical mean of the squared readings\nand B as the maximum of the average readings. For ATA-GP-UCB-QFF, we \ufb01t a SE kernel with\nl2 = 0.1 on the given sensor locations and approximate it with m = 162 = 256 features (Figure 1 e).\nObservations: We \ufb01nd that ATA-GP-UCB outperforms TGP-UCB uniformly over all experiments,\nwhich is consistent with our theoretical results. We also see that the performance of ATA-GP-UCB\nunder the Nystr\u00f6m approximation is no worse than that under the QFF approximation. Not only that,\nthe scope of the latter is limited due to its dependence on the analytical form of the kernel, whereas\nthe former is data-adaptive and hence, well suited for practical purposes.\nEffect of truncation: For heavy-tailed rewards, the sub-Gaussian constant R = \u221e. Hence, we\nexclude evaluating GP-UCB in the above experiments. Now, we demonstrate the effect of truncation\non GP-UCB in the following experiment. First, we generate a function f \u2208 Hk(X ) and normalize it\nbetween [0, 1]. Then, we simulate rewards as y(x) = f (x) + \u03b7, where \u03b7 takes values in {\u221210, 10},\nuniformly, for any single random point in X , and is zero everywhere else. We run GP-UCB with\n\u03b2t = ln t and see that the posterior mean after T = 104 rounds is not a good estimate of f. However,\nby truncating reward samples which exceeds t1/4 (truncation threshold in TGP-UCB when \u03b1 = 1) at\nround t, we get an (almost) accurate estimator of f. Not only that, the con\ufb01dence interval around\nthis estimator contains f at every point in X , which in turn ensures good performance. We plot the\nrespective con\ufb01dence sets averaged over 50 such randomizations of noise (Figure 1 f).\n\n7 Conclusion\n\nTo the best of our knowledge, this is the \ufb01rst work to formulate and solve BO under heavy-tailed\nobservations. We have demonstrated the failure of existing BO methods and developed (almost)\noptimal algorithms using kernel approximation techniques, which are easy to implement and perform\nwell in practice, with rigorous theoretical guarantees. It is worth noting that using a Bernstein type\nconcentration bound in each direction of the approximate feature space, we are able to obtain the near\noptimal regret scaling for ATA-GP-UCB (Algorithm 2). Instead, if one can derive a Bernstein type\nbound for self-normalized processes which depends on the (1 + \u03b1)-th moments of rewards, then one\nmay not have to resort to feature approximation to get optimal regret. Further, instead of truncating\nthe payoffs, one can also consider building and studying a median of means-style estimator [8] in the\n(approximate) feature space and hope to develop an optimal algorithm.\n\n9\n\n\fAcknowledgments\n\nThe authors are grateful to the anonymous reviewers for their valuable comments. S. R. Chowdhury\nis supported by the Google India PhD fellowship grant and the Tata Trusts travel grant. A. Gopalan is\ngrateful for support from the DST INSPIRE faculty grant IFA13- ENG-69.\n\nReferences\n[1] Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical\n\nguarantees. In Advances in Neural Information Processing Systems, pages 775\u2013783, 2015.\n\n[2] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online stochastic\n\noptimization under correlated bandit feedback. In ICML, pages 1557\u20131565, 2014.\n\n[3] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach.\n\nLearn. Res., 13:281\u2013305, February 2012.\n\n[4] Salomon Bochner. Lectures on Fourier integrals. Princeton University Press, 1959.\n\n[5] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust\noptimization with gaussian processes. In Advances in Neural Information Processing Systems,\npages 5760\u20135770, 2018.\n\n[6] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian optimization of\nexpensive cost functions, with application to active user modeling and hierarchical reinforcement\nlearning. arXiv preprint arXiv:1012.2599, 2010.\n\n[7] S\u00e9bastien Bubeck, R\u00e9mi Munos, Gilles Stoltz, and Csaba Szepesv\u00e1ri. X-armed bandits. Journal\n\nof Machine Learning Research, 12(May):1655\u20131695, 2011.\n\n[8] S\u00e9bastien Bubeck, Nicolo Cesa-Bianchi, and G\u00e1bor Lugosi. Bandits with heavy tail. IEEE\n\nTransactions on Information Theory, 59(11):7711\u20137717, 2013.\n\n[9] Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo Rosasco.\nGaussian process optimization with adaptive sketching: Scalable and no regret. In Conference\non Learning Theory, 2019.\n\n[10] Alexandra Carpentier and Michal Valko. Extreme bandits. In Advances in Neural Information\n\nProcessing Systems, pages 1089\u20131097, 2014.\n\n[11] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 844\u2013853. JMLR.\norg, 2017.\n\n[12] Petros Drineas and Michael W Mahoney. On the nystr\u00f6m method for approximating a gram\njournal of machine learning research, 6(Dec):\n\nmatrix for improved kernel-based learning.\n2153\u20132175, 2005.\n\n[13] Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression\nwith provably adaptive mean, variance, and regularization. The Journal of Machine Learning\nResearch, 19(1):650\u2013683, 2018.\n\n[14] R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor set selection.\nIn Proceedings of the 9th ACM/IEEE International Conference on Information Processing in\nSensor Networks, IPSN \u201910, pages 209\u2013219, New York, NY, USA, 2010. ACM.\n\n[15] Javier Gonzalez, Joseph Longworth, David C James, and Neil D Lawrence. Bayesian optimiza-\n\ntion for synthetic gene design. arXiv preprint arXiv:1505.01627, 2015.\n\n[16] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[17] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive\nentropy search for ef\ufb01cient global optimization of black-box functions. In Advances in neural\ninformation processing systems, pages 918\u2013926, 2014.\n\n10\n\n\f[18] Francis Begnaud Hildebrand. Introduction to numerical analysis. Courier Corporation, 1987.\n\n[19] Daniel Hsu and Sivan Sabato. Heavy-tailed regression with a generalized median-of-means. In\n\nInternational Conference on Machine Learning, pages 37\u201345, 2014.\n\n[20] Krishna P. Jagannathan, Mihalis G. Markakis, Eytan Modiano, and John N. Tsitsiklis. Through-\nput optimal scheduling over time-varying channels in the presence of heavy-tailed traf\ufb01c. IEEE\nTrans. Information Theory, 60(5):2896\u20132909, 2014. doi: 10.1109/TIT.2014.2311125. URL\nhttps://doi.org/10.1109/TIT.2014.2311125.\n\n[21] Kirthevasan Kandasamy, Jeff Schneider, and Barnab\u00e1s P\u00f3czos. High dimensional bayesian\noptimisation and bandits via additive models. In International Conference on Machine Learning,\npages 295\u2013304, 2015.\n\n[22] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In\nProceedings of the fortieth annual ACM symposium on Theory of computing, pages 681\u2013690.\nACM, 2008.\n\n[23] Tor Lattimore. A scale free algorithm for stochastic bandits with bounded kurtosis. In Advances\n\nin Neural Information Processing Systems, pages 1584\u20131593, 2017.\n\n[24] Andres Munoz Medina and Scott Yang. No-regret algorithms for heavy-tailed linear bandits. In\n\nInternational Conference on Machine Learning, pages 1642\u20131650, 2016.\n\n[25] Jonas Mo\u02c7ckus. On bayesian methods for seeking the extremum. In Optimization Techniques\n\nIFIP Technical Conference, pages 400\u2013404. Springer, 1975.\n\n[26] Mojmir Mutny and Andreas Krause. Ef\ufb01cient high dimensional bayesian optimization with\nIn Advances in Neural Information Processing\n\nadditivity and quadrature fourier features.\nSystems, pages 9005\u20139016, 2018.\n\n[27] Joaquin Quinonero-Candela, Carl Edward Rasmussen, and Christopher KI Williams. Approxi-\nmation methods for gaussian process regression. Large-scale kernel machines, pages 203\u2013224,\n2007.\n\n[28] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[29] Sidney I Resnick. Heavy-tail phenomena: probabilistic and statistical modeling. Springer\n\nScience & Business Media, 2007.\n\n[30] Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional bayesian\noptimization via additive models with overlapping groups. arXiv preprint arXiv:1802.07028,\n2018.\n\n[31] Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy\ngaussian process bandit optimization. In Conference on Learning Theory, pages 1723\u20131742,\n2017.\n\n[32] Yevgeny Seldin, Fran\u00e7ois Laviolette, Nicolo Cesa-Bianchi, John Shawe-Taylor, and Peter Auer.\nPac-bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):\n7086\u20137093, 2012.\n\n[33] Rajat Sen, Kirthevasan Kandasamy, and Sanjay Shakkottai. Noisy blackbox optimization\nusing multi-\ufb01delity queries: A tree search approach. In The 22nd International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 2096\u20132105, 2019.\n\n[34] Han Shao, Xiaotian Yu, Irwin King, and Michael R Lyu. Almost optimal algorithms for linear\nstochastic bandits with heavy-tailed payoffs. In Advances in Neural Information Processing\nSystems, pages 8420\u20138429, 2018.\n\n[35] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process\noptimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th\nInternational Conference on International Conference on Machine Learning, pages 1015\u20131022.\nOmnipress, 2010.\n\n11\n\n\f[36] Bharath Sriperumbudur and Zolt\u00e1n Szab\u00f3. Optimal rates for random fourier features.\n\nAdvances in Neural Information Processing Systems, pages 1144\u20131152, 2015.\n\nIn\n\n[37] Steven H Strogatz. Exploring complex networks. nature, 410(6825):268, 2001.\n\n[38] Sattar Vakili, Keqin Liu, and Qing Zhao. Deterministic sequencing of exploration and exploita-\ntion for multi-armed bandit problems. IEEE Journal of Selected Topics in Signal Processing, 7\n(5):759\u2013767, 2013.\n\n[39] Zi Wang and Stefanie Jegelka. Max-value entropy search for ef\ufb01cient bayesian optimization.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n3627\u20133635. JMLR. org, 2017.\n\n[40] Zi Wang, Bolei Zhou, and Stefanie Jegelka. Optimization as estimation with gaussian processes\n\nin bandit settings. In Arti\ufb01cial Intelligence and Statistics, pages 1022\u20131031, 2016.\n\n[41] Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale bayesian\n\noptimization in high-dimensional spaces. arXiv preprint arXiv:1706.01445, 2017.\n\n[42] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr\u00f6m method\nvs random fourier features: A theoretical and empirical comparison. In Advances in neural\ninformation processing systems, pages 476\u2013484, 2012.\n\n[43] Xiaotian Yu, Han Shao, Michael R Lyu, and Irwin King. Pure exploration of multi-armed bandits\nwith heavy-tailed payoffs. In Proceedings of the Thirty-Fourth Conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 937\u2013946, 2018.\n\n12\n\n\f", "award": [], "sourceid": 7725, "authors": [{"given_name": "Sayak", "family_name": "Ray Chowdhury", "institution": "Indian Institute of Science"}, {"given_name": "Aditya", "family_name": "Gopalan", "institution": "Indian Institute of Science"}]}