{"title": "Adversarially Robust Optimization with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 5760, "page_last": 5770, "abstract": "In this paper, we consider the problem of Gaussian process (GP) optimization with an added robustness requirement: The returned point may be perturbed by an adversary, and we require the function value to remain as high as possible even after this perturbation. This problem is motivated by settings in which the underlying functions during optimization and implementation stages are different, or when one is interested in finding an entire region of good inputs rather than only a single point. We show that standard GP optimization algorithms do not exhibit the desired robustness properties, and provide a novel confidence-bound based algorithm StableOpt for this purpose. We rigorously establish the required number of samples for StableOpt to find a near-optimal point, and we complement this guarantee with an algorithm-independent lower bound. We experimentally demonstrate several potential applications of interest using real-world data sets, and we show that StableOpt consistently succeeds in finding a stable maximizer where several baseline methods fail.", "full_text": "Adversarially Robust Optimization\n\nwith Gaussian Processes\n\nIlija Bogunovic\nLIONS, EPFL\n\nilija.bogunovic@epfl.ch\n\nJonathan Scarlett\n\nNational University of Singapore\nscarlett@comp.nus.edu.sg\n\nStefanie Jegelka\n\nMIT CSAIL\n\nstefje@mit.edu\n\nVolkan Cevher\nLIONS, EPFL\n\nvolkan.cevher@epfl.ch\n\nAbstract\n\nIn this paper, we consider the problem of Gaussian process (GP) optimization\nwith an added robustness requirement: The returned point may be perturbed by\nan adversary, and we require the function value to remain as high as possible\neven after this perturbation. This problem is motivated by settings in which the\nunderlying functions during optimization and implementation stages are different,\nor when one is interested in \ufb01nding an entire region of good inputs rather than only\na single point. We show that standard GP optimization algorithms do not exhibit\nthe desired robustness properties, and provide a novel con\ufb01dence-bound based\nalgorithm STABLEOPT for this purpose. We rigorously establish the required num-\nber of samples for STABLEOPT to \ufb01nd a near-optimal point, and we complement\nthis guarantee with an algorithm-independent lower bound. We experimentally\ndemonstrate several potential applications of interest using real-world data sets,\nand we show that STABLEOPT consistently succeeds in \ufb01nding a stable maximizer\nwhere several baseline methods fail.\n\n1\n\nIntroduction\n\nGaussian processes (GP) provide a powerful means for sequentially optimizing a black-box function\nf that is costly to evaluate and for which noisy point evaluations are available. Since its introduction,\nthis approach has successfully been applied to numerous applications, including robotics [21],\nhyperparameter tuning [30], recommender systems [34], environmental monitoring [31], and more.\nIn many such applications, one is faced with various forms of uncertainty that are not accounted for\nby standard algorithms. In robotics, the optimization is often performed via simulations, creating a\nmismatch between the assumed function and the true one; in hyperparameter tuning, the function is\ntypically similarly mismatched due to limited training data; in recommendation systems and several\nother applications, the underlying function is inherently time-varying, so the returned solution may\nbecome increasingly stale over time; the list goes on.\nIn this paper, we address these considerations by studying the GP optimization problem with an\nadditional requirement of adversarial robustness: The returned point may be perturbed by an\nadversary, and we require the function value to remain as high as possible even after this perturbation.\nThis problem is of interest not only for attaining improved robustness to uncertainty, but also for\nsettings where one seeks a region of good points rather than a single point, and for other related\nmax-min optimization settings (see Section 4 for further discussion).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRelated work. Numerous algorithms have been developed for GP optimization in recent\nyears [7, 16, 17, 26, 28, 31, 35]. Beyond the standard setting, several important extensions have\nbeen considered, including batch sampling [11, 12, 14], contextual and time-varying settings [6, 20],\nsafety requirements [33], and high dimensional settings [18, 25, 36], just to name a few.\nVarious forms of robustness in GP optimization have been considered previously. A prominent\nexample is that of outliers [22], in which certain function values are highly unreliable; however, this\nis a separate issue from that of the present paper, since in [22] the returned point does not undergo\nany perturbation. Another related recent work is [2], which assumes that the sampled points (rather\nthan the returned one) are subject to uncertainty. In addition to this difference, the uncertainty in [2]\nis random rather than adversarial, which is complementary but distinct from our work. The same is\ntrue of a setting called unscented Bayesian optimization in [23]. Moreover, no theoretical results are\ngiven in [2, 23]. In [8], a robust form of batch optimization is considered, but with yet another form\nof robustness, namely, some experiments in the batch may fail to produce an outcome. Level-set\nestimation [7, 15] is another approach to \ufb01nding regions of good points rather than a single point.\nOur problem formulation is also related to other works on non-convex robust optimization, par-\nticularly those of Bertsimas et al. [3, 4]. In these works, a stable design x is sought that solves\nminx2D max2U f (x + ). Here, resides in some uncertainty set U, and represents the perturba-\ntion against which the design x needs to be protected. Related problems have also recently been\nconsidered in the context of adversarial training (e.g., [29]). Compared to these works, our work\nbears the crucial difference that the objective function is unknown, and we can only learn about it\nthrough noisy point evaluations (i.e. bandit feedback).\nOther works, such as [5, 9, 19, 32, 37], have considered robust optimization problems of the following\nform: For a given set of objectives {f1, . . . , fm} \ufb01nd x achieving maxx2D mini=1,...,m fi(x). We\ndiscuss variations of our algorithm for this type of formulation in Section 4.\nContributions. We introduce a variant of GP optimization in which the returned solution is required\nto exhibit stability/robustness to an adversarial perturbation. We demonstrate the failures of standard\nalgorithms, and introduce a new algorithm STABLEOPT that overcomes these limitations. We\nprovide a novel theoretical analysis characterizing the number of samples required for STABLEOPT to\nattain a near-optimal robust solution, and we complement this with an algorithm-independent lower\nbound. We provide several variations of our max-min optimization framework and theory, including\nconnections and comparisons to previous works. Finally, we experimentally demonstrate a variety of\npotential applications of interest using real-world data sets, and we show that STABLEOPT consistently\nsucceeds in \ufb01nding a stable maximizer where several baseline methods fail.\n\n2 Problem Setup\n\nModel. Let f be an unknown reward function over a domain D \u2713 Rp for some dimension p. At\ntime t, we query f at a single point xt 2 D and observe a noisy sample yt = f (xt) + zt, where\nzt \u21e0N (0, 2). After T rounds, a recommended point x(T ) is returned. In contrast with the standard\ngoal of making f (x(T )) as high as possible, we seek to \ufb01nd a point such that f remains high even\nafter an adversarial perturbation; a formal description is given below.\nWe assume that D is endowed with a kernel function k(\u00b7,\u00b7), and f has a bounded norm in the\ncorresponding Reproducing Kernel Hilbert Space (RKHS) Hk(D). Speci\ufb01cally, we assume that\nf 2F k(B), where\n(1)\nand kfkk is the RKHS norm in Hk(D). It is well-known that this assumption permits the construction\nof con\ufb01dence bounds via Gaussian process (GP) methods; see Lemma 1 below for a precise statement.\nWe assume that the kernel is normalized to satisfy k(x, x) = 1 for all x 2 D. Two commonly-\nconsidered kernels are squared exponential (SE) and Mat\u00e9rn:\n\nFk(B) = {f 2H k(D) : kfkk \uf8ff B},\n\nkSE(x, x0) = exp\u2713kx x0k2\n(\u232b)\u21e3p2\u232bkx x0k\n\n21\u232b\n\n2l2\n\n\u2318J\u232b\u21e3p2\u232bkx x0k\n\nl\n\nkMat(x, x0) =\n\n\u25c6 ,\n\n(2)\n\n(3)\n\n\u2318,\n\nl\n\n2\n\n\ff (x\u21e40)\n\nf (x\u21e4\u270f )\n\nf (x + )\n\nmin\n\n2\u270f(x)\n\nf\n\nucb\n\nlcb\n\nFigure 1: (Left) A function f and its maximizer x\u21e40. (Middle) For \u270f = 0.06 and d(x, x0) = |x x0|,\nthe decision that corresponds to the local \u201cwider\u201d maximum of f is the optimal \u270f-stable decision.\n(Right) GP-UCB selects a point that nearly maximizes f, but is suboptimal in the \u270f-stable sense.\n\nwhere l denotes the length-scale, \u232b> 0 is an additional parameter that dictates the smoothness, and\nJ(\u232b) and (\u232b) denote the modi\ufb01ed Bessel function and the gamma function, respectively [24].\nGiven a sequence of decisions {x1,\u00b7\u00b7\u00b7 , xt} and their noisy observations {y1,\u00b7\u00b7\u00b7 , yt}, the posterior\ndistribution under a GP(0, k(x, x0)) prior is also Gaussian, with the following mean and variance:\n\n\u00b5t(x) = kt(x)TKt + 2I1yt,\n\n(4)\n(5)\n\n2\n\nwhere kt(x) =\u21e5k(xi, x)\u21e4t\nOptimization goal. Let d(x, x0) be a function mapping D \u21e5 D ! R, and let \u270f be a constant known\nas the stability parameter. For each point x 2 D, we de\ufb01ne a set\n\nt (x) = k(x, x) kt(x)TKt + 2I1kt(x),\ni=1, and Kt =\u21e5k(xt, xt0)\u21e4t,t0 is the kernel matrix.\n\u270f(x) =x0 x : x0 2 D and d(x, x0) \uf8ff \u270f .\n\nOne can interpret this as the set of perturbations of x such that the newly obtained point x0 is within a\n\u201cdistance\u201d \u270f of x. While we refer to d(\u00b7,\u00b7) as the distance function throughout the paper, we allow it\nto be a general function, and not necessarily a distance in the mathematical sense. As we exemplify\nin Section 5, the parameter \u270f might be naturally speci\ufb01ed as part of the application, or might be better\ntreated as a parameter that can be tuned for the purpose of the overall learning goal.\nWe de\ufb01ne an \u270f-stable optimal input to be any x\u21e4\u270f satisfying\n\n(6)\n\nx\u21e4\u270f 2 arg max\nx2D\n\nmin\n\n2\u270f(x)\n\nf (x + ).\n\nOur goal is to report a point x(T ) that is stable in the sense of having low \u270f-regret, de\ufb01ned as\n\nr\u270f(x) = min\n\n2\u270f(x\u21e4\u270f )\n\nf (x\u21e4\u270f + ) min\n2\u270f(x)\n\nf (x + ).\n\nNote that once r\u270f(x) \uf8ff \u2318 for some accuracy value \u2318 0, it follows that\nf (x\u21e4\u270f + ) \u2318.\n\nf (x + ) min\n\nmin\n\n2\u270f(x)\n\n2\u270f(x\u21e4\u270f )\n\n(7)\n\n(8)\n\n(9)\n\nWe assume that d(\u00b7,\u00b7) and \u270f are known, i.e., they are speci\ufb01ed as part of the optimization formulation.\nAs a running example, we consider the case that d(x, x0) = kx x0k for some norm k\u00b7k (e.g.,\n`2-norm), in which case achieving low \u270f-regret amounts to favoring broad peaks instead of narrow\nones, particularly for higher \u270f; see Figure 1 for an illustration. In Section 4, we discuss how our\nframework also captures a variety of other max-min optimization settings of interest.\nFailure of classical methods. Various algorithms have been developed for achieving small regret in\nthe standard GP optimization problem. A prominent example is GP-UCB, which chooses\n\nxt 2 arg max\nx2D\n\nucbt1(x),\n\n(10)\n\nwhere ucbt1(x) := \u00b5t1(x) + 1/2\nt t1(x). This algorithm is guaranteed to achieve sublinear\ncumulative regret with high probability [31], for a suitably chosen t. While this is useful when\n\n3\n\n\fAlgorithm 1 The STABLEOPT algorithm\nInput: Domain D, GP prior (\u00b50, 0, k), parameters {t}t1, stability \u270f, distance function d(\u00b7,\u00b7)\n1: for t = 1, 2, . . . , T do\n2:\n\nSet\n\n\u02dcxt = arg max\n\nx2D\n\nmin\n\n2\u270f(x)\n\nucbt1(x + ).\n\n3:\n4:\n5:\n6: end for\n\nSet t = arg min2\u270f(\u02dcxt) lcbt1(\u02dcxt + )\nSample \u02dcxt + t, and observe yt = f (\u02dcxt + t) + zt\nUpdate \u00b5t, t, ucbt and lcbt according to (5) and (12), by including {(\u02dcxt + t, yt)}\n\nx\u21e4\u270f = x\u21e40,1 in general for a given \ufb01xed \u270f 6= 0, these two decisions may not coincide, and hence,\nmin2\u270f(x\u21e40 ) f (x\u21e40 + ) can be signi\ufb01cantly smaller than min2\u270f(x\u21e4\u270f ) f (x\u21e4\u270f + ).\nA visual example is given in Figure 1 (Right), where the selected point of GP-UCB for t = 20 is\nshown. This point nearly maximizes f, but it is strictly suboptimal in the \u270f-stable sense. The same\nlimitation applies to other GP optimization strategies (e.g., [7, 16, 17, 26, 28, 35]) whose goal is to\nidentify the global non-robust maximum x\u21e40. In Section 5, we will see that more advanced baseline\nstrategies also perform poorly when applied to our problem.\n\n3 Proposed Algorithm and Theory\n\nOur proposed algorithm, STABLEOPT, is described in Algorithm 1, and makes use of the following\ncon\ufb01dence bounds depending on an exploration parameter t (cf., Lemma 1 below):\n\n(13)\n\n(11)\n(12)\n\nucbt1(x) := \u00b5t1(x) + 1/2\nlcbt1(x) := \u00b5t1(x) 1/2\n\nt t1(x),\nt t1(x).\n\nThe point \u02dcxt de\ufb01ned in (13) is the one having the highest \u201cstable\u201d upper con\ufb01dence bound. However,\nthe queried point is not \u02dcxt, but instead \u02dcxt + t, where t 2 \u270f(\u02dcxt) is chosen to minimize the lower\ncon\ufb01dence bound. As a result, the algorithm is based on two distinct principles: (i) optimism in the\nface of uncertainty when it comes to selecting \u02dcxt; (ii) pessimism in the face of uncertainty when it\ncomes to anticipating the perturbation of \u02dcxt. The \ufb01rst of these is inherent to existing algorithms such\nas GP-UCB [31], whereas the second is unique to the adversarially robust GP optimization problem.\nAn example illustration of STABLEOPT\u2019s execution is given in the supplementary material.\nWe have left the \ufb01nal reported point x(T ) unspeci\ufb01ed in Algorithm 1, as there are numerous reasonable\nchoices. The simplest choice is to simply return x(T ) = \u02dcxT , but in our theory and experiments, we\nwill focus on x(T ) equaling the point in {\u02dcx1, . . . , \u02dcxT} with the highest lower con\ufb01dence bound.\nFinding an exact solution to the optimization of the acquisition function in (13) can be challenging\nin practice. When D is continuous, a natural approach is to \ufb01nd an approximate solution using an\nef\ufb01cient local search algorithm for robust optimization with a fully known objective function, such as\nthat of [4].\n\n3.1 Upper bound on \u270f-regret\n\nOur analysis makes use of the maximum information gain under t noisy measurements:\n\nt = max\nx1,\u00b7\u00b7\u00b7 ,xt\n\n1\n2\n\nlog det(It + 2Kt),\n\n(14)\n\nwhich has been used in numerous theoretical works on GP optimization following [31].\nSTABLEOPT depends on the exploration parameter t, which determines the width of the con\ufb01dence\nbounds. In our main result, we set t as in [10] and make use of the following.\n\n1In this discussion, we take d(x, x0) = kx x0k2, so that \u270f = 0 recovers the standard non-stable regret [31].\n\n4\n\n\fLemma 1.\n\n[10] Fix f 2F k(B), and consider the sampling model yt = f (xt) + zt with zt \u21e0\nN (0, 2), with independence between times. Under the choice t =B + q2(t1 + log e\n\u21e0 )2,\n\nthe following holds with probability at least 1 \u21e0:\n\nlcbt1(x) \uf8ff f (x) \uf8ff ucbt1(x),\n\n8x 2 D,8t 1.\n\n(15)\n\nThe following theorem bounds the performance of STABLEOPT under a suitable choice of the\nrecommended point x(T ). The proof is given in the supplementary material.\nTheorem 1. (Upper Bound) Fix \u270f> 0, \u2318> 0, B > 0, T 2 Z, \u21e0 2 (0, 1), and a distance function\nd(x, x0), and suppose that\n(16)\nwhere C1 = 8/ log(1 + 2). For any f 2F k(B), STABLEOPT with t set as in Lemma 1 achieves\nr\u270f(x(T )) \uf8ff \u2318 after T rounds with probability at least 1 \u21e0, where\n(17)\n\nT\nT T \n\nC1\n\u23182 ,\n\nx(T ) = \u02dcxt\u21e4,\n\nlcbt1(\u02dcxt + ).\n\nt\u21e4 = arg max\nt=1,...,T\n\nmin\n\n2\u270f(\u02dcxt)\n\nThis result holds for general kernels, and for both \ufb01nite and continuous D. Our analysis bounds\nfunction values according to the con\ufb01dence bounds in Lemma 1 analogously to GP-UCB [31], but\nalso addresses the non-trivial challenge of characterizing the perturbations t. While we focused on\nthe non-Bayesian RKHS setting, the proof can easily be adapted to handle the Bayesian optimization\n(BO) setting in which f \u21e0 GP(0, k); see Section 4 for further discussion.\nTheorem 1 can be made more explicit by substituting bounds on T ;\nO((log T )p+1) for the SE kernel, and T = O(T\n\nin particular, T =\n2\u232b+p(p+1) log T ) for the Mat\u00e9rn-\u232b kernel [31].\n\nThe former yields T = O\u21e4 1\n\nhides dimension-independent log factors), which we will shortly see nearly matches an algorithm-\nindependent lower bound.\n\n\u23182p in Theorem 1 for constant B, 2, and \u270f (where O\u21e4(\u00b7)\n\n\u23182 log 1\n\np(p+1)\n\n2, \u2318 2 0, 1\n\n3.2 Lower bound on \u270f-regret\nEstablishing lower bounds under general kernels and input domains is an open problem even in the\nnon-robust setting. Accordingly, the following theorem focuses on a more speci\ufb01c setting than the\nupper bound: We let the input domain be [0, 1]p for some dimension p, and we focus on the SE and\nMat\u00e9rn kernels. In addition, we only consider the case that d(x, x0) = kx x0k2, though extensions\nto other norms (e.g., `1 or `1) follow immediately from the proof.\nTheorem 2. (Lower Bound) Let D = [0, 1]p for some dimension p, and set d(x, x0) = kx x0k2.\nFix \u270f 2 0, 1\n2, B > 0, and T 2 Z. Suppose there exists an algorithm that, for any\nf 2F k(B), reports a point x(T ) achieving \u270f-regret r\u270f(x(T )) \uf8ff \u2318 after T rounds with probability at\nleast 1 \u21e0. Then, provided that \u2318\n\u2318p/2.\n\u23182 log B\n1. For k = kSE, it is necessary that T =\u2326 2\n2. For k = kMat\u00e9rn, it is necessary that T =\u2326 2\n\u23182 B\n\u2318p/\u232b.\nHere we assume that the stability parameter \u270f, dimension p, target probability \u21e0, and kernel parame-\nters l, \u232b are \ufb01xed (i.e., not varying as a function of the parameters T , \u2318, and B).\nThe proof is based on constructing a \ufb01nite subset of \u201cdif\ufb01cult\u201d functions in Fk(B) and applying lower\nbounding techniques from the multi-armed bandit literature, also making use of several auxiliary\nresults from the non-robust setting [27]. More speci\ufb01cally, the functions in the restricted class consist\nof narrow negative \u201cvalleys\u201d that the adversary can perturb the reported point into, but that are hard\nto identify until a large number of samples have been taken.\n\nB and \u21e0 are suf\ufb01ciently small, we have the following:\n\nFor constant 2 and B, the condition for the SE kernel simpli\ufb01es to T =\u2326 1\nnearly matching the upper bound T = O\u21e4 1\n\n\u2318p/2, thus\n\u23182p of STABLEOPT established above. In\n\nthe case of the Mat\u00e9rn kernel, more signi\ufb01cant gaps remain between the upper and lower bounds;\nhowever, similar gaps remain even in the standard (non-robust) setting [27].\n\n\u23182 log 1\n\n\u23182 log 1\n\n5\n\n\f4 Variations of STABLEOPT\n\nWhile the above problem formulation seeks robustness within an \u270f-ball corresponding to the distance\nfunction d(\u00b7,\u00b7), our algorithm and theory apply to a variety of seemingly distinct settings. We outline\na few such settings here; in the supplementary material, we give details of their derivations.\nRobust Bayesian optimization. Algorithm 1 and Theorem 1 extend readily to the Bayesian setting\nin which f \u21e0 GP(0, k(x, x0)). In particular, since the proof of Theorem 1 is based on con\ufb01dence\nbounds, the only change required is selecting t to be that used for the Bayesian setting in [31]. As a\nresult, our framework also captures the novel problem of adversarially robust Bayesian optimization.\nAll of the variations outlined below similarly apply to both the Bayesian and non-Bayesian settings.\nRobustness to unknown parameters. Consider the scenario where an unknown function f : D \u21e5\n\u21e5 ! R has a bounded RKHS norm under some composite kernel k((x, \u2713), (x0, \u27130)). Some special\ncases include k((x, \u2713), (x0, \u27130)) = k(x, x0) + k(\u2713, \u27130) and k((x, \u2713), (x0, \u27130)) = k(x, x0)k(\u2713, \u27130)\n[20]. The posterior mean \u00b5t(x, \u2713) and variance 2\nt (x, \u2713) conditioned on the previous observations\n(x1, \u27131, y1), ..., (xt1, \u2713t1, yt1) are computed analogously to (5) [20].\nA robust optimization formulation in this setting is to seek x that solves\n\n(18)\nThat is, we seek to \ufb01nd a con\ufb01guration x that is robust against any possible parameter vector \u2713 2 \u21e5.\nPotential applications of this setup include the following:\n\nmax\nx2D\n\nmin\n\u27132\u21e5\n\nf (x, \u2713).\n\n\u2022 In areas such a robotics, we may have numerous parameters to tune (given by x and \u2713 collec-\ntively), but when it comes to implementation, some of them (i.e., only \u2713) become out of our\ncontrol. Hence, we need to be robust against whatever values they may take.\n\n\u2022 We wish to tune hyperparameters in order to make an algorithm work simultaneously for\na number of distinct data types that bear some similarities/correlations. The data types are\nrepresented by \u2713, and we can choose the data type to our liking during the optimization stage.\nSTABLEOPT can be used to solve (18); we maintain \u2713t instead of t, and modify the main steps to\n(19)\n\nxt 2 arg max\nx2D\n\u2713t 2 arg min\n\u27132\u21e5\n\nucbt1(x, \u2713),\n\nmin\n\u27132\u21e5\nlcbt1(xt, \u2713).\n\n(20)\n\nAt each time step, STABLEOPT receives a noisy observation yt = f (xt, \u2713t) + zt, which is used\nwith (xt, \u2713t) for computing the posterior. As explained in the supplementary material, the guarantee\nr\u270f(x(T )) \uf8ff \u2318 in Theorem 1 is replaced by min\u27132\u21e5 f (x(T ), \u2713) maxx2D min\u27132\u21e5 f (x, \u2713) \u2318.\nRobust estimation. Continuing with the consideration of a composite kernel on (x, \u2713), we consider\nthe following practical problem variant proposed in [4]. Let \u00af\u2713 2 \u21e5 be an estimate of the true problem\ncoef\ufb01cient \u2713\u21e4 2 \u21e5. Since, \u00af\u2713 is an estimate, the true coef\ufb01cient satis\ufb01es \u2713\u21e4 = \u00af\u2713 + \u2713, where \u2713\nrepresents uncertainty in \u00af\u2713. Often, practitioners solve maxx2D f (x, \u00af\u2713) and ignore the uncertainty.\nAs a more sophisticated approach, we let \u270f(\u00af\u2713) =\u2713 \u00af\u2713 : \u2713 2 \u21e5 and d(\u00af\u2713, \u2713) \uf8ff \u270f , and consider\nthe following robust problem formulation:\n\nFor the given estimate \u00af\u2713, the main steps of STABLEOPT in this setting are\n\nmax\nx2D\n\nmin\n\n\u27132\u270f(\u00af\u2713)\n\nf (x, \u00af\u2713 + \u2713).\n\nxt 2 arg max\nx2D\n\nmin\n\n\u27132\u270f(\u00af\u2713)\n\n\u2713,t 2 arg min\n\u27132\u270f(\u00af\u2713)\n\nucbt1(x, \u00af\u2713 + \u2713),\n\nlcbt1(xt, \u00af\u2713 + \u2713),\n\n(21)\n\n(22)\n\n(23)\n\nand the noisy observations take the form yt = f (xt, \u00af\u2713 + \u2713,t) + zt. The guarantee r\u270f(x(T )) \uf8ff \u2318 in\nTheorem 1 is replaced by min\u27132\u270f(\u00af\u2713) f (x(T ), \u00af\u2713 + \u2713) maxx2D min\u27132\u270f(\u00af\u2713) f (x, \u00af\u2713 + \u2713) \u2318.\nRobust group identi\ufb01cation. In some applications, it is natural to partition D into disjoint subsets,\nand search for the subset with the highest worst-case function value (see Section 5 for a movie\n\n6\n\n\frecommendation example). Letting G = {G1, . . . , Gk} denote the groups that partition the input\nspace, the robust optimization problem is given by\n\nand the algorithm reports a group G(T ). The main steps of STABLEOPT are given by\n\nmax\nG2G\n\nmin\nx2G\n\nf (x),\n\nmin\nx2G\n\nucbt1(x),\n\nG2G\n\nGt 2 arg max\nxt 2 arg min\nx2Gt\n\nlcbt1(x),\n\n(24)\n\n(25)\n\n(26)\n\nand the observations are of the form yt = f (xt) + zt as usual. The guarantee r\u270f(x(T )) \uf8ff \u2318 in\nTheorem 1 is replaced by minx2G(T ) f (x) maxG2G minx2G f (x) \u2318.\nThe preceding variations of STABLEOPT, as well as their resulting variations of Theorem 1, follow\nby substituting certain (unconventional) choices of d(\u00b7,\u00b7) and \u270f into Algorithm 1 and Theorem 1,\nwith (x, \u2713) in place of x where appropriate. The details are given in the supplementary material.\n\n5 Experiments\n\nIn this section, we experimentally validate the performance of STABLEOPT by comparing against\nseveral baselines. Each algorithm that we consider may distinguish between the sampled point (i.e.,\nthe point that produces the noisy observation yt) and the reported point (i.e., the point believed to\nbe near-optimal in terms of \u270f-stability). For STABLEOPT, as described in Algorithm 1, the sampled\npoint is \u02dcxt + t, and the reported point after time t is the one in {\u02dcx1, . . . , \u02dcxt} with the highest value\nof min2\u270f(\u02dcxt) lcbt(\u02dcxt + ).2 In addition, we consider the following baselines:\n\nexisting methods that search for the non-robust global maximum.\n\n\u2022 GP-UCB (see (10)). We consider GP-UCB to be a good representative of the wide range of\n\u2022 MAXIMIN-GP-UCB. We consider a natural generalization of GP-UCB in which, at each time\n\nstep, the sampled and reported point are both given by\n\nxt = arg max\n\nx2D\n\nmin\n\n2\u270f(x)\n\nucbt1(x + ).\n\n(27)\n\n\u2022 STABLE-GP-RANDOM. The sampling point xt at every time step is chosen uniformly at\nrandom, while the reported point at time t is chosen to be the point among the sampled points\n{x1, . . . , xt} according to the same rule as the one used for STABLEOPT.\nis again chosen in the same way as in STABLEOPT.\n\n\u2022 STABLE-GP-UCB. The sampled point is given by the GP-UCB rule, while the reported point\n\nAs observed in existing works (e.g., [7, 31]), the theoretical choice of t is overly conservative. We\ntherefore adopt a constant value of 1/2\nt = 2.0 in each of the above methods, which we found to\nprovide a suitable exploration/exploitation trade-off for each of the above algorithms.\nGiven a reported point x(t) at time t, the performance metric is the \u270f-regret r\u270f(x(t)) given in (8). Two\nobservations are in order: (i) All the baselines are heuristic approaches for our problem, and they do\nnot have guarantees in terms of near-optimal stability; (ii) We do not compare against other standard\nGP optimization methods, as their performance is comparable to that of GP-UCB; in particular, they\nsuffer from exactly the same pitfalls described at the end of Section 2.\nSynthetic function. We consider the synthetic function from [4] (see Figure 2a), given by\n\nfpoly(x, y) = 2x6 + 12.2x5 21.2x4 6.2x + 6.4x3 + 4.7x2 y6 + 11y5\n\n 43.3y4 + 10y + 74.8y3 56.9y2 + 4.1xy + 0.1y2x2 0.4y2x 0.4x2y.\n\n(28)\n\n2This is slightly different from Theorem 1, which uses the con\ufb01dence bound lcb\u23271 for x\u2327 instead of\nadopting the common bound lcbt. We found the latter to be more suitable when the kernel hyperparameters are\nupdated online, whereas Theorem 1 assumes a known kernel. Theorem 1 can be adapted to use lcbt alone by\nintersecting the con\ufb01dence bounds at each time instant so that they are monotonically shrinking [15].\n\n7\n\n\f(a) fpoly(x, y)\n\n(b) gpoly(x, y)\n\n(c) \u270f-regret\n\nFigure 2: (Left) Synthetic function from [4]. (Middle) Counterpart with worst-case perturbations.\n(Right) The performance. In this example, STABLEOPT signi\ufb01cantly outperforms the baselines.\n\nwhere\n\nmin\n\nThe decision space is a uniformly spaced grid of points in ((0.95, 3.2), (0.45, 4.4)) of size\n104. There exist multiple local maxima, and the global maximum is at (x\u21e4f , y\u21e4f ) = (2.82, 4.0),\nwith fpoly(x\u21e4f , y\u21e4f ) = 20.82. Similarly as in [4], given stability parameters = (x, y), where\nkk2 \uf8ff 0.5, the robust optimization problem is\nmax\n(x,y)2D\n\ngpoly(x, y),\n\n(29)\n\ngpoly(x, y) :=\n\n(x,y)20.5(x,y)\n\nfpoly(x x, y y).\n\n(30)\nA plot of gpoly is shown in Figure 2b. The global maximum is attained at (x\u21e4g, y\u21e4g) = (0.195, 0.284)\nand gpoly(x\u21e4g, y\u21e4g) = 4.33, and the inputs maximizing f yield gpoly(x\u21e4f , y\u21e4f ) = 22.34.\nWe initialize the above algorithms by selecting 10 uniformly random inputs (x, y), keeping those\npoints the same for each algorithm. The kernel adopted is a squared exponential ARD kernel. We\nrandomly sample 500 points whose function value is above 15.0 to learn the GP hyperparameters\nvia maximum likelihood, and then run the algorithms with these hyperparameters. The observation\nnoise standard deviation is set to 0.1, and the number of sampling rounds is T = 100. We repeat\nthe experiment 100 times and show the average performance in Figure 2c. We observe that STA-\nBLEOPT signi\ufb01cantly outperforms the baselines in this experiment. In the later rounds, the baselines\nreport points that are close to the global optimizer, which is suboptimal with respect to the \u270f-regret.\nLake data. In the supplementary material, we provide an analogous experiment to that above using\nchlorophyll concentration data from Lake Z\u00fcrich, with STABLEOPT again performing best.\nRobust robot pushing. We consider the deterministic version of the robot pushing objective\nfrom [35], with publicly available code.3 The goal is to \ufb01nd a good pre-image for pushing an\nobject to a target location. The 3-dimensional function takes as input the robot location (rx, ry) and\npushing duration rt, and outputs f (rx, ry, rt) = 5 dend, where dend is the distance from the pushed\nobject to the target location. The domain D is continuous: rx, ry 2 [5, 5] and rt 2 [1, 30].\nWe consider a twist on this problem in which there is uncertainty regarding the precise target location,\nso one seeks a set of input parameters that is robust against a number of different potential locations.\nIn the simplest case, the number of such locations is \ufb01nite, meaning we can model this problem\nas r 2 arg maxr2D mini2[m] fi(r), where each fi corresponds to a different target location, and\n[m] = {1, . . . , m}. This is a special case of (18) with a \ufb01nite set \u21e5 of cardinality m.\nIn our experiment, we use m = 2. Hence, our goal is to \ufb01nd an input con\ufb01guration r that is robust\nagainst two different target locations. The \ufb01rst target is uniform over the domain, and the second\nis uniform over the `1-ball centered at the \ufb01rst target location with radius r = 2.0. We initialize\neach algorithm with one random sample from each fi. We run each method for T = 100 rounds,\nand for a reported point rt at time t, we compare the methods in terms of the robust objective\nmini2[m] fi(rt). We perform a fully Bayesian treatment of the hyperparameters, sampling every 10\nrounds as in [17, 35]. We average over 30 random pairs of {f1, f2} and report the results in Figure 3.\nSTABLEOPT noticeably outperforms its competitors except in some of the very early rounds. We\nnote that the apparent discontinuities in certain curves are a result of the hyperparmeter re-estimation.\n\n3https://github.com/zi-w/Max-value-Entropy-Search\n\n8\n\n0123x01234y605040302010010200123x01234y605040302010020406080100t0510152025-regretStableOptGP-UCBMaxiMin-GP-UCBStable-GP-UCBStable-GP-Random\fFigure 3: Robust robot pushing experiment (Left) and MovieLens-100K experiment (Right)\n\nGroup movie recommendation. Our goal in this task is to recommend a group of movies to a user\nsuch that every movie in the group is to their liking. We use the MovieLens-100K dataset, which\nconsists of 1682 movies and 943 users. The data takes the form of an incomplete matrix R of ratings,\nwhere Ri,j is the rating of movie i given by the user j. To impute the missing rating values, we\napply non-negative matrix factorization with p = 15 latent factors. This produces a feature vector for\neach movie mi 2 Rp and user uj 2 Rp. We use 10% of the user data for training, in which we \ufb01t a\nGaussian distribution P (u) = N (u|\u00b5, \u2303). For a given user uj in the test set, P (u) is considered to\ni uj, corresponding to a GP with a linear kernel.\nbe a prior, and the objective is given by fj(mi) = mT\nWe cluster the movie feature vectors into k = 80 groups, i.e., G = {G1, . . . , Gk}, via the k-means\nalgorithm. Similarly to (26), the robust optimization problem for a given user j is\n(31)\n\ngj(G),\n\nmax\nG2G\n\nwhere gj(G) = minmi2G fj(mi). That is, for the user with feature vector uj, our goal is to \ufb01nd the\ngroup of movies to recommend such that the entire collection of movies is robust with respect to the\nuser\u2019s preferences.\nIn this experiment, we compare STABLEOPT against GP-UCB and MAXIMIN-GP-UCB. We\nreport the \u270f-regret given by gj(G\u21e4) gj(G(t)) where G\u21e4 is the maximizer of (31), and G(t)\nis the reported group after time t. Since we are reporting groups rather than points, the base-\nlines require slight modi\ufb01cations: At time t, GP-UCB selects the movie mt (i.e., asks for the\nuser\u2019s rating of it) and reports the group G(t) to which mt belongs. MAXIMIN-GP-UCB re-\nports G(t) 2 arg maxG2G minm2G ucbt1(m) and then selects mt 2 arg minm2G(t) ucbt1(m).\nFinally, STABLEOPT reports a group in the same way as MAXIMIN-GP-UCB, but selects mt\nanalogously to (26). In Figure 3, we show the average \u270f-regret, where the average is taken over\n500 different test users. In this experiment, the average \u270f-regret decreases rapidly after only a small\nnumber of rounds. Among the three methods, STABLEOPT is the only one that \ufb01nds the optimal\nmovie group.\n\n6 Conclusion\n\nWe have introduced and studied a variant of GP optimization in which one requires stability/robustness\nto an adversarial perturbation. We demonstrated the failures of existing algorithms, and provided a\nnew algorithm STABLEOPT that overcomes these limitations, with rigorous guarantees. We showed\nthat our framework naturally applies to several interesting max-min optimization formulations, and\nwe demonstrated signi\ufb01cant improvements over some natural baselines in the experimental examples.\nAn interesting direction for future work is to study the \u270f-stable optimization formulation in the\ncontext of hyperparameter tuning (e.g., for deep neural networks). One might expect that wide\nfunction maxima in hyperparameter space provide better generalization than narrow maxima, but\nestablishing this requires further investigation. Similar considerations are an ongoing topic of debate\nin understanding the parameter space rather than the hyperparmeter space, e.g., see [13].\nAcknowledgment. This work was partially supported by the Swiss National Science Foundation\n(SNSF) under grant number 407540_167319, by the European Research Council (ERC) under the\nEuropean Union\u2019s Horizon 2020 research and innovation programme (grant agreement no725594\n- time-data), by DARPA DSO\u2019s Lagrange program under grant FA86501827838, and by an NUS\nstartup grant.\n\n9\n\n020406080100t2024Avg.Min.Obj.Val.GP-UCBMaxiMin-GP-UCBStable-GP-UCBStable-GP-RandomStableOptt=5t=10t=15t=20t=250.00.20.40.60.81.0Avg.-regretGP-UCBMaxiMin-GP-UCBStableOpt\fReferences\n[1] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling\nTechnical report,\n\nin a rigged casino: The adversarial multi-armed bandit problem.\nhttp://www.dklevine.com/archive/refs4462.pdf, 1998.\n\n[2] Justin J. Beland and Prasanth B. Nair. Bayesian optimization under uncertainty. NIPS BayesOpt\n\n2017 workshop, 2017.\n\n[3] Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Nonconvex robust optimization\n\nfor problems with constraints. INFORMS Journal on Computing, 22(1):44\u201358, 2010.\n\n[4] Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Robust optimization for uncon-\n\nstrained simulation-based problems. Operations Research, 58(1):161\u2013178, 2010.\n\n[5] Ilija Bogunovic, Slobodan Mitrovi\u00b4c, Jonathan Scarlett, and Volkan Cevher. Robust submodular\nmaximization: A non-uniform partitioning approach. In International Conference on Machine\nLearning (ICML), pages 508\u2013516, 2017.\n\n[6] Ilija Bogunovic, Jonathan Scarlett, and Volkan Cevher. Time-varying Gaussian process bandit\noptimization. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\npages 314\u2013323, 2016.\n\n[7] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance\nreduction: A uni\ufb01ed approach to Bayesian optimization and level-set estimation. In Advances\nin Neural Information Processing Systems (NIPS), pages 1507\u20131515, 2016.\n\n[8] Ilija Bogunovic, Junyao Zhao, and Volkan Cevher. Robust maximization of non-submodular\nobjectives. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\npages 890\u2013899, 2018.\n\n[9] Robert S Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for\nnon-convex objectives. In Advances in Neural Information Processing Systems (NIPS), pages\n4708\u20134717, 2017.\n\n[10] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International\n\nConference on Machine Learning (ICML), pages 844\u2013853, 2017.\n\n[11] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel Gaussian\nprocess optimization with upper con\ufb01dence bound and pure exploration. In Joint European\nConference on Machine Learning and Knowledge Discovery in Databases, pages 225\u2013240.\nSpringer, 2013.\n\n[12] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation\ntradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research,\n15(1):3873\u20133923, 2014.\n\n[13] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize\n\nfor deep nets. In International Conference on Machine Learning (ICML), 2017.\n\n[14] Javier Gonz\u00e1lez, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch Bayesian optimiza-\ntion via local penalization. In International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 648\u2013657, 2016.\n\n[15] Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause. Active learning for level\nIn International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages\n\nset estimation.\n1344\u20131350, 2013.\n\n[16] Philipp Hennig and Christian J Schuler. Entropy search for information-ef\ufb01cient global opti-\n\nmization. Journal of Machine Learning Research, 13(Jun):1809\u20131837, 2012.\n\n[17] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive\nentropy search for ef\ufb01cient global optimization of black-box functions. In Advances in Neural\nInformation Processing Systems (NIPS), pages 918\u2013926, 2014.\n\n[18] Kirthevasan Kandasamy, Jeff Schneider, and Barnab\u00e1s P\u00f3czos. High dimensional Bayesian\noptimisation and bandits via additive models. In International Conference on Machine Learning\n(ICML), pages 295\u2013304, 2015.\n\n[19] Andreas Krause, H Brendan McMahan, Carlos Guestrin, and Anupam Gupta. Robust sub-\nmodular observation selection. Journal of Machine Learning Research, 9(Dec):2761\u20132801,\n2008.\n\n10\n\n\f[20] Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization.\n\nAdvances in Neural Information Processing Systems (NIPS), pages 2447\u20132455, 2011.\n\nIn\n\n[21] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait\noptimization with Gaussian process regression. In International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI), pages 944\u2013949, 2007.\n\n[22] Ruben Martinez-Cantin, Kevin Tee, and Michael McCourt. Practical Bayesian optimization in\nthe presence of outliers. In International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2018.\n\n[23] J. Nogueira, R. Martinez-Cantin, A. Bernardino, and L. Jamone. Unscented Bayesian opti-\nmization for safe robot grasping. In IEEE/RSJ Int. Conf. Intel. Robots and Systems (IROS),\n2016.\n\n[24] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning,\n\nvolume 1. MIT press Cambridge, 2006.\n\n[25] Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional\nBayesian optimization via additive models with overlapping groups. In International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 298\u2013307, 2018.\n\n[26] Binxin Ru, Michael Osborne, and Mark McLeod. Fast information-theoretic Bayesian optimisa-\n\ntion. arXiv preprint arXiv:1711.00673, 2017.\n\n[27] Jonathan Scarlett, Ilijia Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy\n\nGaussian process bandit optimization. In Conference on Learning Theory (COLT), 2017.\n\n[28] Shubhanshu Shekhar and Tara Javidi. Gaussian process bandits with adaptive discretization.\n\narXiv preprint arXiv:1712.01447, 2017.\n\n[29] Aman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness\nwith principled adversarial training. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[30] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine\nlearning algorithms. In Advances in Neural information Processing Systems (NIPS), pages\n2951\u20132959, 2012.\n\n[31] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process op-\ntimization in the bandit setting: No regret and experimental design. In International Conference\non Machine Learning (ICML), pages 1015\u20131022, 2010.\n\n[32] Matthew Staib, Bryan Wilder, and Stefanie Jegelka. Distributionally robust submodular maxi-\n\nmization. arXiv preprint arXiv:1802.05249, 2018.\n\n[33] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization\nwith Gaussian processes. In International Conference on Machine Learning (ICML), pages\n997\u20131005, 2015.\n\n[34] Hastagiri P Vanchinathan, Isidor Nikolic, Fabio De Bona, and Andreas Krause. Explore-\nexploit in top-n recommender systems via Gaussian processes. In Proceedings of the 8th ACM\nConference on Recommender systems, pages 225\u2013232. ACM, 2014.\n\n[35] Zi Wang and Stefanie Jegelka. Max-value entropy search for ef\ufb01cient Bayesian optimization.\n\nIn International Conference on Machine Learning (ICML), pages 3627\u20133635, 2017.\n\n[36] Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional\nBayesian optimization via structural kernel learning. In International Conference on Machine\nLearning (ICML), pages 3656\u20133664, 2017.\n\n[37] Bryan Wilder. Equilibrium computation for zero sum games with submodular structure. In\n\nConference on Arti\ufb01cial Intelligence (AAAI), 2017.\n\n11\n\n\f", "award": [], "sourceid": 2782, "authors": [{"given_name": "Ilija", "family_name": "Bogunovic", "institution": "EPFL Lausanne"}, {"given_name": "Jonathan", "family_name": "Scarlett", "institution": "National University of Singapore"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}