{"title": "Maximizing acquisition functions for Bayesian optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9884, "page_last": 9895, "abstract": "Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process. Fully maximizing acquisition functions produces the Bayes' decision rule, but this ideal is difficult to achieve since these functions are frequently non-trivial to optimize. This statement is especially true when evaluating queries in parallel, where acquisition functions are routinely non-convex, high-dimensional, and intractable. We first show that acquisition functions estimated via Monte Carlo integration are consistently amenable to gradient-based optimization. Subsequently, we identify a common family of acquisition functions, including EI and UCB, whose characteristics not only facilitate but justify use of greedy approaches for their maximization.", "full_text": "Maximizing acquisition functions\n\nfor Bayesian optimization\n\nJames T. Wilson\u21e4\n\nImperial College London\n\nFrank Hutter\n\nUniversity of Freiburg\n\nMarc Peter Deisenroth\nImperial College London\n\nPROWLER.io\n\nAbstract\n\nBayesian optimization is a sample-ef\ufb01cient approach to global optimization that\nrelies on theoretically motivated value heuristics (acquisition functions) to guide\nits search process. Fully maximizing acquisition functions produces the Bayes\u2019\ndecision rule, but this ideal is dif\ufb01cult to achieve since these functions are fre-\nquently non-trivial to optimize. This statement is especially true when evaluating\nqueries in parallel, where acquisition functions are routinely non-convex, high-\ndimensional, and intractable. We \ufb01rst show that acquisition functions estimated\nvia Monte Carlo integration are consistently amenable to gradient-based optimiza-\ntion. Subsequently, we identify a common family of acquisition functions, includ-\ning EI and UCB, whose properties not only facilitate but justify use of greedy\napproaches for their maximization.\n\n1\n\nIntroduction\n\nBayesian optimization (BO) is a powerful framework for tackling complicated global optimization\nproblems [32, 40, 44]. Given a black-box function f : X!Y , BO seeks to identify a maximizer\nx\u21e4 2 arg maxx2X f (x) while simultaneously minimizing incurred costs. Recently, these strategies\nhave demonstrated state-of-the-art results on many important, real-world problems ranging from\nmaterial sciences [17, 57], to robotics [3, 7], to algorithm tuning and con\ufb01guration [16, 29, 53, 56].\nFrom a high-level perspective, BO can be understood as the application of Bayesian decision theory\nto optimization problems [11, 14, 45]. One \ufb01rst speci\ufb01es a belief over possible explanations for f\nusing a probabilistic surrogate model and then combines this belief with an acquisition function L\nto convey the expected utility for evaluating a set of queries X. In theory, X is chosen according\nto Bayes\u2019 decision rule as L\u2019s maximizer by solving for an inner optimization problem [19, 42,\n59]. In practice, challenges associated with maximizing L greatly impede our ability to live up to\nthis standard. Nevertheless, this inner optimization problem is often treated as a black-box unto\nitself. Failing to address this challenge leads to a systematic departure from BO\u2019s premise and,\nconsequently, consistent deterioration in achieved performance.\nTo help reconcile theory and practice, we present two modern perspectives for addressing BO\u2019s\ninner optimization problem that exploit key aspects of acquisition functions and their estimators.\nFirst, we clarify how sample path derivatives can be used to optimize a wide range of acquisition\nfunctions estimated via Monte Carlo (MC) integration. Second, we identify a common family of\nsubmodular acquisition functions and show that its constituents can generally be expressed in a\nmore computer-friendly form. These acquisition functions\u2019 properties enable greedy approaches to\nef\ufb01ciently maximize them with guaranteed near-optimal results. Finally, we demonstrate through\ncomprehensive experiments that these theoretical contributions directly translate to reliable and,\noften, substantial performance gains.\n\n\u21e4Correspondence to j.wilson17@imperial.ac.uk\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1\n\nInner optimization problem\nInner optimization problem\n\nRuntimes for inner optimization\nRuntimes for inner optimization\n\n2.2\n2.2\n\n0.1\n0.1\n\n897\n897\n\n673\n673\n\n449\n449\n\n224\n224\n\n2\n\nb\n\nc\n\na\n\nd\n\n-0.2\n-0.2\n\nk=1\n\ni=1\n\nk=1\n\ni=1\n\n256\n256\n\n512\n512\n\n768\n768\n\n1024\n1024\n\n0.0\n-0.0\n\n0.75\n0.75\n\n1.00\n1.00\n\n0.00\n0.00\n\n0.25\n0.25\n\n0\n0\n\n3\n3\n\n-2.6\n-2.6\n\n0.2\n0.2\n\ny\nt\ni\nl\ni\nt\nu\n\ny\nt\ni\nl\ni\nt\nu\n\nf\nf\ne\ne\ni\ni\nl\nl\ne\ne\nb\nb\n\nr\nr\no\no\ni\ni\nr\nr\ne\ne\nt\nt\ns\ns\no\no\nP\nP\n\nd\nd\ne\ne\nt\nt\nc\nc\ne\ne\np\np\nx\nx\nE\nE\n\ns\ns\nd\nd\nn\nn\no\no\nc\nc\ne\ne\nS\nS\nU\nU\nP\nP\nC\nC\n\n0.50\n0.50\nx  R\nx  R\n\nNum. prior observations\nNum. prior observations\n\ni=1 X L(X)\n\nParallelism\nParallelism\nq = 32\nq = 32\nq = 16\nq = 16\nq = 8\nq = 8\nq = 4\nq = 2\nq = 4\nq = 2\n\nAlgorithm 1 BO outer-loop (joint parallelism)\nAlgorithm 1 BO outer-loop (joint parallelism)\n1: Given model M and data D , ;.\n1: Given model M, acquisition L and data D\n2: for t = 1, . . . , T do\n2: for t = 1, . . . , T do\nFit model M to current data D\n3:\nFit model M to current data D\nSet qt = min(T  t, q)\n3:\n4:\nSet q = min(qmax, T  t)\n4:\nFind X 2 arg maxX2qt\nInner optimization problem\n5:\nFind X 2 arg maxX02X q L(X0)\n5:\nEvaluate y f (X)\n6:\nUpdate D D [ {(xk, yk)}q\n7:\nEvaluate y f (X)\n6:\n8: end for\nUpdate D D [ {(xi, yi)}q\n7:\n8: end for\nAlgorithm 2 BO outer-loop (greedy parallelism)\n1: Given model M and data D , ;.\nAlgorithm 2 BO outer-loop (greedy parallelism)\n2: for t = 1, . . . , T do\n1: Given model M, acquisition L and data D\nFigure 1: (a) Pseudo-code for standard BO\u2019s \u201couter-loop\u201d with parallelism q; the inner optimization problem\nFit model M to current data D\n3:\n2: for t = 1, . . . , T do\nSet X ;\n4:\nis boxed in red. (b\u2013c) GP-based belief and expected utility (EI), given four initial observations \u2018\u2022\u2019. The aim of\nFit model M to current data D\n3:\nSet X ;\n4:\nthe inner optimization problem is to \ufb01nd the optimal query \u2018\n\u2019. (d) Time to compute 214 evaluations of MC\nfor k = 1, . . . min(T  t, q) do\n5:\nFind xk 2 arg maxx2X L(X[{x})\nq-EI using a GP surrogate for varied observation counts and degrees of parallelism. Runtimes fall off at the\n6:\nfor j = 1, . . . min(qmax, T  t) do\n5:\n7:\nX X [ {xk}\nFind xj 2 arg maxx2X L(X[{x})\n\ufb01nal step because q decreases to accommodate evaluation budget T = 1, 024.\n6:\nend for\n8:\n7:\nX X [ {xj}\nEvaluate y f (X)\n9:\nend for\n8:\nUpdate D D [ {(xk, yk)}q\n10:\nEvaluate y f (X)\n9:\n11: end for\n2 Background\nUpdate D D [ {(xi, yi)}q\n10:\n11: end for\nBayesian optimization relies on both a surrogate model M and an acquisition function L to de\ufb01ne\na strategy for ef\ufb01ciently maximizing a black-box function f. At each \u201couter-loop\u201d iteration (Fig-\nure 1a), this strategy is used to choose a set of queries X whose evaluation advances the search\nprocess. This section reviews related concepts and closes with discussion of the associated inner\noptimization problem. For an in-depth review of BO, we defer to the recent survey [52].\nWithout loss of generality, we assume BO strategies evaluate q designs X 2 Rq\u21e5d in parallel so\nthat setting q = 1 recovers purely sequential decision-making. We denote available information\nregarding f as D = {(xi, yi)}...\ni=1 and, for notational convenience, assume noiseless observations\ny = f (X). Additionally, we refer to L\u2019s parameters (such as an improvement threshold) as and\nto M\u2019s parameters as \u21e3. Henceforth, direct reference to these terms will be omitted where possible.\nSurrogate models A surrogate model M provides a probabilistic interpretation of f whereby\npossible explanations for the function are seen as draws f k \u21e0 p(f|D). In some cases, this belief\nis expressed as an explicit ensemble of sample functions [28, 54, 60]. More commonly however,\nM dictates the parameters \u2713 of a (joint) distribution over the function\u2019s behavior at a \ufb01nite set of\npoints X. By \ufb01rst tuning the model\u2019s (hyper)parameters \u21e3 to explain for D, a belief is formed as\np(y|X,D) = p(y; \u2713) with \u2713 M (X; \u21e3). Throughout, \u2713 M (X; \u21e3) is used to denote that belief\np\u2019s parameters \u2713 are speci\ufb01ed by model M evaluated at X. A member of this latter category, the\nGaussian process prior (GP) is the most widely used surrogate and induces a multivariate normal\nbelief \u2713 , (\u00b5, \u2303) M (X; \u21e3) such that p(y; \u2713) = N (y; \u00b5, \u2303) for any \ufb01nite set X (see Figure 1b).\nAcquisition functions With few exceptions, acquisition functions amount to integrals de\ufb01ned in\nterms of a belief p over the unknown outcomes y = {y1, . . . , yq} revealed when evaluating a black-\nbox function f at corresponding input locations X = {x1, . . . , xq}. This formulation naturally\noccurs as part of a Bayesian approach whereby the value of querying X is determined by accounting\nfor the utility provided by possible outcomes yk \u21e0 p(y|X,D). Denoting the chosen utility function\nas `, this paradigm leads to acquisition functions de\ufb01ned as expectations\n\n1\n\n1\n\nL(X;D, ) = Ey [`(y; )] =Z `(y; )p(y|X,D)dy .\n\n(1)\n\nA seeming exception to this rule, non-myopic acquisition functions assign value by further con-\ni=1 impact our broader understanding\nsidering how different realizations of Dk\nof f and usually correspond to more complex, nested integrals. Figure 1c portrays a prototypical\nacquisition surface and Table 1 exempli\ufb01es popular, myopic and non-myopic instances of (1).\n\nq D[{\n\ni )}q\n\n(xi, yk\n\nInner optimization problem Maximizing acquisition functions plays a crucial role in BO as the\nprocess through which abstract machinery (e.g. model M and acquisition function L) yields con-\ncrete actions (e.g. decisions regarding sets of queries X). Despite its importance however, this inner\noptimization problem is often neglected. This lack of emphasis is largely attributable to a greater\n\n2\n\n1:GivenmodelM,acquisitionL,anddataD2:fort=1,...,Tdo3:FitmodelMtocurrentdataD4:Setq=min(qmax,Tt)5:FindX2argmaxX02XqL(X0)6:Evaluatey f(X)7:UpdateD D[{(xk,yk)}qk=18:endfor\fReparameterization\n\nMM\n\nAbbr.\n\nEI\nPI\nSR\nUCB\nES\nKG\n\nAcquisition Function L\nEy[max(ReLU(y  \u21b5))]\nEy[max( (y  \u21b5))]\nEy[max(\u00b5 +p\u21e1/2||)]\n\nEz[max(ReLU(\u00b5 + Lz  \u21b5))]\nEz[max(\u00b5 +p\u21e1/2|Lz|)]\nEya[H(Eyb|ya[ +(yb  max(yb))])] Eza[H(Ezb[softmax( \u00b5b|a+Lb|azb\nEya[max(\u00b5b + \u2303b,a\u23031\na,aLaza)]\n\nEza[max(\u00b5b + \u2303b,a\u23031\n\nEy[max(y)]\n\nEz[max(( \u00b5+Lz\u21b5\nEz[max(\u00b5 + Lz)]\n\n\u2327\n\n))]\n\na,a(ya  \u00b5a))]\n\nY\n\nY\n\nY\n\nY\n\nN\n\nN\n\n)])]\n\n\u2327\n\nTable 1: Examples of reparameterizable acquisition functions; the \ufb01nal column indicates whether they belong\nto the MM family (Section 3.2). Glossary: +/ denotes the right-/left-continuous Heaviside step function;\nReLU and  recti\ufb01ed linear and sigmoid nonlinearities, respectively; H the Shannon entropy; \u21b5 an improve-\nment threshold; \u2327 a temperature parameter; LL> , \u2303 the Cholesky factor; and, residuals  \u21e0N (0, \u2303).\nLastly, non-myopic acquisition function (ES and KG) are assumed to be de\ufb01ned using a discretization. Terms\nassociated with the query set and discretization are respectively denoted via subscripts a and b.\n\nfocus on creating new and improved machinery as well as on applying BO to new types of prob-\nlems. Moreover, elementary examples of BO facilitate L\u2019s maximization. For example, optimizing\na single query x 2 Rd is usually straightforward when x is low-dimensional and L is myopic.\nOutside these textbook examples, however, BO\u2019s inner optimization problem becomes qualitatively\nmore dif\ufb01cult to solve. In virtually all cases, acquisition functions are non-convex (frequently due to\nthe non-convexity of plausible explanations for f). Accordingly, increases in input dimensionality d\ncan be prohibitive to ef\ufb01cient query optimization. In the generalized setting with parallelism q  1,\nthis issue is exacerbated by the additional scaling in q. While this combination of non-convexity\nand (acquisition) dimensionality is problematic, the routine intractability of both non-myopic and\nparallel acquisition poses a commensurate challenge.\nAs is generally true of integrals, the majority of acquisition functions are intractable. Even Gaus-\nsian integrals, which are often preferred because they lead to analytic solutions for certain instances\nof (1), are only tractable in a handful of special cases [13, 18, 20]. To circumvent the lack of\nclosed-form solutions, researchers have proposed a wealth of diverse methods. Approximation\nstrategies [13, 15, 60], which replace a quantity of interest with a more readily computable one, work\nwell in practice but may not to converge to the true value. In contrast, bespoke solutions [10, 20, 22]\nprovide (near-)analytic expressions but typically do not scale well with dimensionality.2 Lastly,\nMC methods [27, 47, 53] are highly versatile and generally unbiased, but are often perceived as\nnon-differentiable and, therefore, inef\ufb01cient for purposes of maximizing L.\nRegardless of the method however, the (often drastic) increase in cost when evaluating L\u2019s proxy\nacts as a barrier to ef\ufb01cient query optimization, and these costs increase over time as shown in\nFigure 1d. In an effort to address these problems, we now go inside the outer-loop and focus on\nef\ufb01cient methods for maximizing acquisition functions.\n\n3 Maximizing acquisition functions\n\nThis section presents the technical contributions of this paper, which can be broken down into two\ncomplementary topics: 1) gradient-based optimization of acquisition functions that are estimated via\nMonte Carlo integration, and 2) greedy maximization of \u201cmyopic maximal\u201d acquisition functions.\nBelow, we separately discuss each contribution along with its related literature.\n\n3.1 Differentiating Monte Carlo acquisitions\n\nGradients are one of the most valuable sources of information for optimizing functions. In this sec-\ntion, we detail both the reasons and conditions whereby MC acquisition functions are differentiable\nand further show that most well-known examples readily satisfy these criteria (see Table 1).\n\n2By near-analytic, we refer to cases where an expression contains terms that cannot be computed exactly\n\nbut for which high-quality solvers exist (e.g. low-dimensional multivariate normal CDF estimators [20, 21]).\n\n3\n\n\fmPm\n\nWe assume that L is an expectation over a multivariate normal belief p(y|X,D) = N (y; \u00b5, \u2303)\nspeci\ufb01ed by a GP surrogate such that (\u00b5, \u2303) M (X). More generally, we assume that samples\ncan be generated as yk \u21e0 p(y|X,D) to form an unbiased MC estimator of an acquisition function\nL(X) \u21e1L m(X) , 1\nk=1 `(yk). Given such an estimator, we are interested in verifying whether\nrL(X) \u21e1 rL m(X) , 1\n\n(2)\nwhere r` denotes the gradient of utility function ` taken with respect to X. The validity of MC\ngradient estimator (2) is obscured by the fact that yk depends on X through generative distribution\np and that rLm is the expectation of `\u2019s derivative rather than the derivative of its expectation.\nOriginally referred to as in\ufb01nitesimal perturbation analysis [8, 24], the reparameterization trick [37,\n50] is the process of differentiating through an MC estimate to its generative distribution p\u2019s param-\neters and consists of two components: i) reparameterizing samples from p as draws from a simpler\nbase distribution \u02c6p, and ii) interchanging differentiation and integration by taking the expectation\nover sample path derivatives.\n\nmXm\n\nk=1 r`(yk),\n\nReparameterization Reparameterization is a way of interpreting samples that makes their differ-\nentiability w.r.t. a generative distribution\u2019s parameters transparent. Often, samples yk \u21e0 p(y; \u2713)\ncan be re-expressed as a deterministic mapping  : Z\u21e5 \u21e5 !Y of simpler random variates\nzk \u21e0 \u02c6p(z) [37, 50]. This change of variables helps clarify that, if ` is a differentiable function\nof y = (z; \u2713), then d`\nIf generative distribution p is multivariate normal with parameters \u2713 = (\u00b5, \u2303), the corresponding\nmapping is then (z; \u2713) , \u00b5 + Lz, where z \u21e0N (0, I) and L is \u2303\u2019s Cholesky factor such that\nLL> = \u2303. Rewriting (1) as a Gaussian integral and reparameterizing, we have\n\nd\u2713 by the chain rule of (functional) derivatives.\n\nd\u2713 = d`\nd\n\nd\n\nL(X) =Z b\n\na\n\n`(y)N (y; \u00b5, \u2303)dy =Z b0\n\na0\n\n`(\u00b5 + Lz)N (z; 0, I)dz ,\n\n(3)\n\nwhere each of the q terms c0i in both a0 and b0 is transformed as c0i = (ci  \u00b5i Pj<i Lijzj)/Lii.\nThe third column of Table 1 grounds (3) with several prominent examples. For a given draw yk \u21e0\nN (\u00b5, \u2303), the sample path derivative of ` w.r.t. X is then\ndyk\ndM(X)\n\nas a function of z therefore sheds light on individual MC sample\u2019s differentiability.\n\nwhere, by minor abuse of notation, we have substituted in yk = zk;M(X). Reinterpreting y\nInterchangeability Since Lm is an unbiased MC estimator consisting of differentiable terms, it is\nnatural to wonder whether the average sample gradient rLm (2) follows suit, i.e. whether\n\nr`(yk) =\n\ndM(X)\n\nd`(yk)\n\ndyk\n\n(4)\n\ndX\n\n,\n\nrL(X) = rEy [`(y)] ?= Ey [r`(y)] \u21e1 rL m(X) ,\n\n(5)\n\nwhere ?= denotes a potential equivalence when interchanging differentiation and expectation. Nec-\nessary and suf\ufb01cient conditions for this interchange are that, as de\ufb01ned under p, integrand ` must\nbe continuous and its \ufb01rst derivative `0 must a.s. exist and be integrable [8, 24]. Wang et al. [59]\ndemonstrated that these conditions are met for a GP with a twice differentiable kernel, provided\nthat the elements in query set X are unique. The authors then use these results to prove that (2) is\nan unbiased gradient estimator for the parallel Expected Improvement (q-EI) acquisition function\n[10, 22, 53]. In later works, these \ufb01ndings were extended to include parallel versions of the Knowl-\nedge Gradient (KG) acquisition function [61, 62]. Figure 2d (bottom right) visualizes gradient-based\noptimization of MC q-EI for parallelism q = 2.\n\nExtensions Rather than focusing on individual examples, our goal is to show differentiability for\na broad class of MC acquisition functions.\nIn addition to its conceptual simplicity, one of MC\nintegration\u2019s primary strengths is its generality. This versatility is evident in Table 1, which catalogs\n(differentiable) reparameterizations for six of the most popular acquisition functions. While some\n\n4\n\n\fAlgorithm 1 BO outer-loop (joint parallelism)\n1: Given model M and data D , ;.\n2: for t = 1, . . . , T do\nFit model M to current data D\n3:\nSet qt = min(T  t, q)\n4:\nFind X 2 arg maxX2qt\nEvaluate y f (X)\nUpdate D D [ {(xk, yk)}q\n\n6:\n7:\n8: end for\n\nk=1\n\n5:\n\ni=1 X L(X)\n\n2\n\nGreedy parallel selection\n\nAcquisition surface (q = 2)\n\nAlgorithm 2 BO outer-loop (greedy parallelism)\n1: Given model M and data D , ;.\n2: for t = 1, . . . , T do\nFit model M to current data D\n3:\nSet X ;\n4:\nGreedy parallel selection\nfor k = 1, . . . min(T  t, q) do\n\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nFind xk 2 arg maxx2X L(X[{x})\nX X [ {xk}\nend for\nEvaluate y f (X)\nUpdate D D [ {(xk, yk)}q\n\nk=1\n\na\n\n0.2\n\nIter. 1\n\n)\n1\nx\n(\nL\n\n0.1\n\n0.0\n\n0.3\n\n)\n}\n\n2\nx\n\n0.3\n\n,\n1\nx\n{\n(\nL\n\n0.2\n\nIter. 2\n\n0.00\n\n0.25\n\n1.0\n\n0.8\n\n0.6\n\nR\n\n\n2\nx\n\n0.4\n\n0.2\n\n0.0\n\nb\n\nc\n\n0.75\n\n1.00\n\n0.0\n\n0.2\n\n0.4\n\nFigure 2: (a) Pseudo-code for BO outer-loop with greedy parallelism, the inner optimization problem is boxed\nin red. (b\u2013c) Successive iterations of greedy maximization, starting from the posterior shown in Figure 1b. (d)\nOn the left, greedily selected query \u2018\n\u2019, trajectory when jointly optimizing\nparallel queries x1 and x2 via stochastic gradient ascent. Darker colors correspond with larger acquisitions.\n\n\u2019; on the right and from \u2018\u21e5\u2019 to \u2018\n\n0.50\nx  R\n\nd\n\n0.6\n\n0.8\n\n1.0\n\nx1  R\n\nof these forms were previously known (EI and KG) or follow freely from the above (SR), others\nrequire additional steps. We summarize these steps below and provide full details in Appendix A.\nIn many cases of interest, utility is measured in terms of discrete events. For example, Probability\nof Improvement [40, 58] is the expectation of a binary event ePI: \u201cwill a new set of results improve\nupon a level \u21b5?\u201d Similarly, Entropy Search [27] contains expectations of categorical events eES:\n\u201cwhich of a set of random variables will be the largest?\u201d Unfortunately, mappings from continuous\nvariables y to discrete events e are typically discontinuous and, therefore, violate the conditions for\n(5). To overcome this issue, we utilize concrete (continuous to discrete) approximations in place of\nthe original, discontinuous mappings [31, 41].\nStill within the context of the reparameterization trick, [31, 41] studied the closely related problem\nof optimizing an expectation w.r.t. a discrete generative distribution\u2019s parameters. To do so, the\nauthors propose relaxing the mapping from, e.g., uniform to categorical random variables with a\ncontinuous approximation so that the (now differentiable) transformed variables closely resemble\ntheir discrete counterparts in distribution. Here, we \ufb01rst map from uniform to Gaussian (rather than\nGumbel) random variables, but the process is otherwise identical. Concretely, we can approximate\nPI\u2019s binary event as\n\n1\n\n\u02dcePI(X; \u21b5, \u2327 ) = max ( (y\u21b5/\u2327)) \u21e1 max (y  \u21b5) ,\n\nus to naturally parallelize this acquisition function as\n\nexpression as UCB(x; ) =R 1\n\nwhere  denotes the left-continuous Heaviside step function,  the sigmoid nonlinearity, and\n\u2327 2 [0,1] acts as a temperature parameter such that the approximation becomes exact as \u2327 ! 0.\nAppendix A.1 further discusses concrete approximations for both PI and ES.\nLastly, the Upper Con\ufb01dence Bound (UCB) acquisition function [55] is typically not portrayed as\nan expectation, seemingly barring the use of MC methods. At the same time, the standard de\ufb01nition\nUCB(x; ) , \u00b5 +  1/2 bares a striking resemblance to the reparameterization for normal random\nvariables (z; \u00b5, ) = \u00b5 + z. By exploiting this insight, it is possible to rewrite this closed-form\n\u00b5 yN (y; \u00b5, 2\u21e12)dy. Formulating UCB as an expectation allows\nUCB(X; ) = Ey\u21e5 max(\u00b5 +p\u21e1/2||)\u21e4,\n\n(7)\nwhere || = |y  \u00b5| denotes the absolute value of y\u2019s residuals. In contrast with existing paral-\nlelizations of UCB [12, 15], Equation (7) directly generalizes its marginal form and can be ef\ufb01ciently\nestimated via MC integration (see Appendix A.2 for the full derivation).\nThese extensions further demonstrate how many of the apparent barriers to gradient-based opti-\nmization of MC acquisition functions can be overcome by borrowing ideas from new (and old)\ntechniques.\n\n(6)\n\n3.2 Maximizing myopic maximal acquisitions\nThis section focuses exclusively on the family of myopic maximal (MM) acquisition functions:\nmyopic acquisition functions de\ufb01ned as the expected max of a pointwise utility function \u02c6`, i.e.\n\n5\n\n1:GivenmodelM,acquisitionLanddataD2:fort=1,...,Tdo3:FitmodelMtocurrentdataD4:SetX ;5:fork=1,...min(qmax,Tt)do6:Findxk2argmaxx2XL(X[{x})7:X X[{xk}8:endfor9:Evaluatey f(X)10:UpdateD D[{(xk,yk)}qk=111:endfor1:GivenmodelM,acquisitionLanddataD2:fort=1,...,Tdo3:FitmodelMtocurrentdataD4:SetX ;5:forj=1,...min(qmax,Tt)do6:Findxj2argmaxx2XL(X[{x})7:X X[{xj}8:endfor9:Evaluatey f(X)10:UpdateD D[{(xi,yi)}qi=111:endfor\fL(X) = Ey[`(y)] = Ey[max \u02c6`(y)]. Of the acquisition functions included in Table 1, this family\nincludes EI, PI, SR, and UCB. We show that these functions have special properties that make\nthem particularly amenable to greedy maximization.\nGreedy maximization is a popular approach for selecting near-optimal sets of queries X to be evalu-\nated in parallel [1, 9, 12, 15, 35, 51]. This iterative strategy is so named because it always \u201cgreedily\u201d\nchooses the query x that produces the largest immediate reward. At each step j = 1, . . . , q, a greedy\nmaximizer treats the j1 preceding choices X<j as constants and grows the set by selecting an addi-\ntional element xj 2 arg maxx2X L(X<j [{ x};D) from the set of possible queries X . Algorithm 2\nin Figure 2 outlines this process\u2019s role in BO\u2019s outer-loop.\n\neL\u21e4 regret when attempting to solve for X\u21e4 2 arg maxX2X q L(X).\n\nSubmodularity Greedy maximization is often linked to the concept of submodularity (SM).\nRoughly speaking, a set function L is SM if its increase in value when adding any new point xj to an\nexisting collection X<j is non-increasing in cardinality k (for a technical overview, see [2]). Greed-\nily maximizing SM functions is guaranteed to produce near-optimal results [39, 43, 46]. Speci\ufb01cally,\nif L is a normalized SM function with maximum L\u21e4, then a greedy maximizer will incur no more\nthan 1\nIn the context of BO, SM has previously been appealed to when establishing outer-loop regret\nbounds [12, 15, 55]. Such applications of SM utilize this property by relating an idealized BO\nstrategy to greedy maximization of a SM objective (e.g., the mutual information between black-box\nfunction f and observations D). In contrast, we show that the family of MM acquisition functions\nare inherently SM, thereby guaranteeing that greedy maximization thereof produces near-optimal\nchoices X at each step of BO\u2019s outer-loop.3 We begin by removing some unnecessary complexity:\n1. Let f k \u21e0 p(f|D) denote the k-th possible explanation of black-box f given observations D.\nBy marginalizing out nuisance variables f (X \\ X), L can be expressed as an expectation over\nfunctions f k themselves rather than over potential outcomes yk \u21e0 p(y|X,D).\n2. Belief p(f|D) and sample paths f k depend solely on D. Hence, expected utility L(X;D) =\nEf [`(f (X))] is a weighted sum over a \ufb01xed set of functions whose weights are constant.\nSince non-negative linear combinations of SM functions are SM [39], L(\u00b7) is SM so long as\nthe same can be said of all functions `(f k(\u00b7)) = max \u02c6`f k(\u00b7).\n3. As pointwise functions, f k and \u02c6` specify the set of values mapped to by X . They therefore\nin\ufb02uences whether we can normalize the utility function such that `(;) = 0, but do not impact\nSM. Appendix A.3 discusses the technical condition of normalization in greater detail. In\n\u02c6`(f k(x)) is guaranteed to be bounded from\ngeneral however, we require that vmin = minx2X\nbelow for all functions under the support of p(f|D).\n\nHaving now eliminated confounding factors, the remaining question is whether max(\u00b7) is SM. Let\nand\nV be the set of possible utility values and de\ufb01ne max(;) = vmin. Then, given sets A\u2713B\u2713V\n8v 2V , it holds that\n\nmax(A[{ v})  max(A)  max(B[{ v})  max(B).\n\n(8)\nProof: We prove the equivalent de\ufb01nition max(A) + max(B)  max(A[B ) + max(A\\B ).\nWithout loss of generality, assume max(A[B ) = max(A). Then, max(B)  max(A\\B )\nsince, for any C\u2713B , max(B)  max(C).\nThis result establishes the MM family as a class of SM set functions, providing strong theoretical\njusti\ufb01cation for greedy approaches to solving BO\u2019s inner-optimization problem.\n\nIncremental form So far, we have discussed greedy maximizers that select a j-th new point xj\nby optimizing the joint acquisition L(X1:j;D) = Ey1:j|D [`(y1:j)] originally de\ufb01ned in (1). A\nclosely related strategy [12, 15, 23, 53] is to formulate the greedy maximizer\u2019s objective as (the\nexpectation of) a marginal acquisition function \u00afL. We refer to this category of acquisition functions,\nwhich explicitly represent the value of X1:j as that of X<j incremented by a marginal quantity,\nas incremental. The most common example of an incremental acquisition function is the iterated\n3An additional technical requirement for SM is that the ground set X be \ufb01nite. Under similar conditions,\n\nSM-based guarantees have been extended to in\ufb01nite ground sets [55], but we have not yet taken these steps.\n\n6\n\n\fexpectation Ey<j|D\u21e5 \u00afL(xj;Dj)\u21e4, where Dj = D[{ (xi, yi)}i<j denotes a fantasy state. Because\nthese integrals are generally intractable, MC integration (Section 3.1) is typically used to estimate\ntheir values by averaging over fantasies formed by sampling from p(y<j|X<j,D).\nIn practice, approaches based on incremental acquisition functions (such as the mentioned MC es-\ntimator) have several distinct advantages over joint ones. Marginal (myopic) acquisition functions\nusually admit differentiable, closed-form solutions. The latter property makes them cheap to eval-\nuate, while the former reduces the sample variance of MC estimators. Moreover, these approaches\ncan better utilize caching since many computationally expensive terms (such as a Cholesky used to\ngenerate fantasies) only change between rounds of greedy maximization.\nA joint acquisition function L can always be expressed as an incremental one by de\ufb01ning \u00afL as the\nexpectation of the corresponding utility function `\u2019s discrete derivative\n(9)\nj=1 (xj; X<j,D).\nTo show why this representation is especially useful for MM acquisition functions, we reuse the\nnotation of (8) to introduce the following straightforward identity\n\n(xj; X<j,D) = Ey1:j|D [(yj; y<j)] = L(X1:j;D) L (X<j;D),\nwith (yj; y<j) = `(y1:j) `(y<j) and L(;;D) = 0 so that L(X1:q; ,D) =Pq\n\nmax(B)  max(A) = ReLU (max(B \\ A)  max(A)) .\n\n(10)\nProof: Since vmin is de\ufb01ned as the smallest possible element of either set, the ReLU\u2019s argument is\nnegative if and only if B\u2019s maximum is a member of A (in which case both sides equate to zero). In\nall other cases, the ReLU can be eliminated and max(B) = max(B \\ A) by de\ufb01nition.\nReformulating the MM marginal gain function as (yj; y<j) = ReLU(`(yj)`(y<j)) now gives the\ndesired result: that the MM family\u2019s discrete derivative is the \u201cimprovement\u201d function. Accordingly,\nthe conditional expectation of (9) given fantasy state Dj is the expected improvement of `, i.e.\n\nEyj|Dj [(yj; y<j)] = EI` (xj;Dj) =Zj\n\n[`(yj)  `(y<j)] p(yj|xj,Dj)dyj,\n\n(11)\n\nwhere j , {yj : `(yj) >` (y<j)}. Since marginal gain function  primarily acts to lower bound\na univariate integral over yj, (11) often admits closed-form solutions. This statement is true of all\nMM acquisition functions considered here, making their incremental forms particularly ef\ufb01cient.\nPutting everything together, an MM acquisition function\u2019s joint and incremental forms equate as\nL(X1:q;D) =Pq\nj=1 Ey<j|D [EI` (xj;Dj))]. For the special case of Expected Improvement per se\n(denoted here as LEI to avoid confusion), this expression further simpli\ufb01es to reveal an exact equiv-\nalence whereby LEI(X1:q;D) = Pq\nj=1 Ey<j|D [LEI(xj;Dj)]. Appending B.3 compares perfor-\nmance when using joint and incremental forms, demonstrating how the latter becomes increasingly\nbene\ufb01cial as the dimensionality of the (joint) acquisition function q \u21e5 d grows.\n4 Experiments\n\nWe assessed the ef\ufb01cacy of gradient-based and submodular strategies for maximizing acquisition\nfunction in two primary settings: \u201csynthetic\u201d, where task f was drawn from a known GP prior, and\n\u201cblack-box\u201d, where f\u2019s nature is unknown to the optimizer. In both cases, we used a GP surrogate\nwith a constant mean and an anisotropic Mat\u00e9rn-5/2 kernel. For black-box tasks, ambiguity regarding\nthe correct function prior was handled via online MAP estimation of the GP\u2019s (hyper)parameters.\nAppendix B.1 further details the setup used for synthetic tasks.\nWe present results averaged over 32 independent trials. Each trial began with three randomly cho-\nsen inputs, and competing methods were run from identical starting conditions. While the general\nnotation of the paper has assumed noise-free observations, all experiments were run with Gaussian\nmeasurement noise leading to observed values \u02c6y \u21e0N (f (x), 1e3).\nAcquisition functions We focused on parallel MC acquisition functions Lm, particularly EI and\nUCB. Results using EI are shown here and those using UCB are provided in extended results\n(Appendix B.3). To avoid confounding variables when assessing BO performance for different\nacquisition maximizers, results using the incremental form of q-EI discussed in Section 3.2 are also\nreserved for extended results.\n\n7\n\n\fEquivalent budget N = 212\nEquivalent budget N = 212\n0.46\n\nEquivalent budget N = 212\nEquivalent budget N = 212\nEquivalent budget N = 214\nEquivalent budget N = 214\n0.46\n\nEquivalent budget N = 212\n0.47\n0.47\n\nEquivalent budget N = 212\n\n0.46\n\nEquivalent budget N = 214\nEquivalent budget N = 214\n0.47\n\nEquivalent budget N = 214\nEquivalent budget N = 214\nEquivalent budget N = 216\nEquivalent budget N = 216\n0.47\n\nEquivalent budget N = 214\n0.47\n0.47\n\nEquivalent budget N = 214\n\n0.47\n\nEquivalent budget N = 216\nEquivalent budget N = 216\n0.47\n\n0.47\n\n0.47\n\n0.47\n\nEquivalent budget N = 216\n\nEquivalent budget N = 216\n\nEquivalent budget N = 216\n\nEquivalent budget N = 216\n\n0.47\n0.47\n\n0.46\n0.47\n\nEquivalent budget N = 212\nEquivalent budget N = 212\n\n0.46\n0.46\n\n4\n4\nR\nR\nn\nn\n\ni\n\ni\n\ns\ns\nP\nP\nG\nG\n\n-0.19\n-0.19\n\n-0.85\n-0.85\n\n-1.50\n-1.50\n\nd=4 q=4\nd=4\nq=4\n\n16\n\n32\n\n-2.16\n-2.16\n\n48\n\n0.62\n0.62\n\n3\n3\n\n8\n8\nR\nR\nn\nn\n\ni\n\ni\n\ns\ns\nP\nP\nG\nG\n\n0.04\n0.04\n\n-0.53\n-0.53\n\n-1.10\n-1.10\n\nd=8 q=8\nd=8\nq=8\n\n64\n\n128\n\n-1.68\n-1.68\n\n192\n\n0.79\n0.79\n\n3\n3\n\n4\nR\nn\n\ni\n\ns\nP\nG\n\n64\n\n8\nR\nn\n\ni\n\ns\nP\nG\n\ni\n\ni\n\ni\n\ns\nP\nG\n\n-0.19\n\n-0.85\n\n-0.19\n\n-0.85\n\n4\nR\nn\n\n-0.19\n-0.19\n-0.21\n4\n4\nR\nR\nn\nn\n-0.85\n-0.85\n-0.90\ns\ns\nP\nP\nG\nG\n-1.50\n-1.50\n-1.58\nq=4\nd=4\nq=4\nq=4\nd=4\nq=4\nd=4 q=4\nq=4\nd=4\n-2.16\n-2.16\n-2.16\n-2.16\n-2.26\n16\n3\n3\n16\n32\n16\n32\n16\n3\n16\n0.62\n0.62\n0.62\n0.62\n0.61\n\n-1.50\n\n-1.50\n\nd=4\n\n3\n\ni\n\ni\n\ni\n\n0.04\n\n0.04\n\n0.04\n\ns\nP\nG\n\n-0.53\n\n-0.53\n\n8\nR\nn\n\n0.04\n0.09\n8\n8\nR\nR\nn\nn\n-0.53\n-0.53\n-0.43\ns\ns\nP\nP\nG\nG\n-1.10\n-1.10\n-0.96\nq=8\nd=8\nq=8\nq=8\nd=8\nq=8\nd=8 q=8\nq=8\nd=8\n-1.68\n-1.68\n-1.68\n-1.68\n-1.48\n3\n64\n3\n128\n64\n128\n64\n64\n3\n0.79\n0.79\n0.79\n0.79\n0.77\n\n-1.10\n\n-1.10\n\nd=8\n\n3\n\nd=8\nd=8\n\n256\n\nd=4\nd=4\n\nq=4\n\nd=4\n\nq=4\n\n3\n16\n\n32\n48\n48\n32\n\n32\n16\n\n32\n\n-0.20\n-0.24\n\n-0.20\n\n-0.20\n\n-0.87\n-0.95\n\n-0.87\n\n-0.87\n\n-0.20\n\n-0.87\n\n-1.54\n\n-1.54\n\n-1.54\n\n-1.54\n-1.66\nq=4\nd=4\nq=4\nq=4\nd=4\nq=4\nd=4 q=4\nq=4\nd=4\n-2.21\n-2.21\n-2.21\n-2.21\n-2.37\n16\n3\n64\n3\n16\n32\n16\n64\n48\n32\n16\n3\n16\n0.62\n0.62\n0.62\n0.62\n0.62\n\n64\n3\n\nd=4\n\nd=4\nd=4\n\n64\n64\n\nq=4\n\nd=4\n\nq=4\n\n3\n16\n\n32\n48\n48\n32\n\n32\n16\n\n32\n\n0.02\n0.06\n\n0.02\n\n0.02\n\n-0.58\n-0.50\n\n-0.58\n\n-0.58\n\n0.02\n\n-0.58\n\n-0.20\n-0.20\n\n-0.87\n-0.87\n\n-1.54\n-1.54\n\n48\n64\n64\n48\n\n-2.21\n-2.21\n48\n3\n32\n48\n3\n0.62\n0.62\n\n0.02\n0.02\n\n-0.58\n-0.58\n\n-1.18\n-1.18\n\n-0.23\n-0.23\n\n-0.93\n-0.93\n\n-1.63\n-1.63\n\n48\n64\n64\n48\n\n-2.33\n-2.33\n48\n3\n32\n48\n3\n0.63\n0.63\n\nd=4\nd=4\n\n64\n64\n\n0.02\n0.02\n\n-0.59\n-0.59\n\n-1.20\n-1.20\n\n-0.23\n\n-0.23\n\n-0.23\n\n-0.93\n\n-0.93\n\n-0.93\n\n-1.63\n\n-1.63\n\n-1.63\n\n-0.23\n\n-0.93\n\n-1.63\n\nd=4\n\nq=4\n\nd=4\n\nq=4\n\nd=4\nq=4\nq=4\n-2.33\n-2.33\n-2.33\n3\n64\n3\n16\n48\n64\n16\n0.63\n0.63\n0.63\n\nq=4\n\nq=4\nd=4\n-2.33\n16\n16\n32\n32\n0.63\n\n64\n3\n\n3\n16\n\n32\n48\n48\n\n32\n16\n\n32\n\n48\n64\n64\n\n48\n32\n\n48\n\n0.02\n\n0.02\n\n0.02\n\n-0.59\n\n-0.59\n\n-0.59\n\n-1.20\n\n-1.20\n\n-1.20\n\n0.02\n\n-0.59\n\n-1.20\n\nq=8\n\nd=8\n\nq=8\n\nq=8\n\nd=8\n\nq=8\n\n64\n3\n\n64\n\n192\n192\n\n128\n128\n\n128\n64\n\n128\n\n256\n256\n\n192\n192\n\n-1.78\n-1.78\n192\n128\n0.81\n0.81\n\n3\n192\n3\n\nd=8\nd=8\n\n256\n256\n\n-1.18\n\n-1.18\n\n-1.18\n\n-1.18\n-1.06\nq=8\nd=8\nq=8\nq=8\nd=8\nq=8\nd=8 q=8\nq=8\nd=8\n-1.78\n-1.78\n-1.78\n-1.78\n-1.61\n3\n256\n64\n3\n128\n64\n192\n256\n128\n64\n64\n3\n0.81\n0.81\n0.81\n0.81\n0.79\n\n256\n3\n\nd=8\n\n64\n3\n\n64\n\n192\n192\n\n128\n128\n\n128\n64\n\n128\n\n256\n256\n\n192\n192\n\n-1.81\n-1.81\n192\n128\n0.82\n0.82\n\n3\n192\n3\n\nd=8\nd=8\n\nd=8\nq=8\nq=8\n-1.81\n-1.81\n-1.81\n3\n256\n3\n64\n256\n192\n64\n0.82\n0.82\n0.82\n\n256\n256\n\nd=8\n\nq=8\n\nq=8\n\nq=8\nd=8\n-1.81\n64\n128\n128\n0.82\n\n64\n3\n\n256\n3\n\nd=8\n\nq=8\n\n64\n\n192\n192\n\n128\n\n128\n64\n\n128\n\n256\n256\n\n192\n\n192\n128\n\n192\n\n0.46\n\n-0.20\n\n-0.86\n\n-1.53\n\n \n\n)\nr\no\ni\nr\n4\np\nR\nn\nw\nn\ni\no\ns\nn\nP\nk\n(\nG\n \ns\nP\nG\n\n \n\n)\nr\no\ni\nr\n8\np\nR\nn\n\nn\nw\no\ns\nn\nP\nk\nG\n(\n \ns\nP\nG\n\ni\n\n-2.19\n\n0.60\n\n3\n\n0.12\n\n-0.35\n\n-0.83\n\n-1.31\n\n0.76\n\n3\n\n \n\n)\nr\no\ni\n6\nr\np\n1\nR\nn\nw\nn\no\ni\nn\ns\nk\nP\n(\nG\n \ns\nP\nG\n\n0.52\n\n0.28\n\n0.04\n\n-0.20\n\n6\n6\n1\n1\nR\nR\nn\nn\n\ni\n\ni\n\n0.42\n0.42\n\n0.06\n0.06\n\n0.42\n\n0.06\n\n6\n1\nR\nn\n\ni\n\ns\nP\nG\n\ni\n\n0.42\n\n0.42\n0.48\n6\n6\n1\n1\nR\nR\nn\nn\n0.06\n0.06\n0.19\ns\ns\nP\nP\nG\nG\n-0.31\n-0.31\n-0.10\n\ni\n\n-0.31\n\n6\n1\nR\nn\n\ni\n\ns\nP\nG\n\n0.42\n\n0.06\n\n-0.31\n\n0.31\n0.31\n\n-0.20\n-0.20\n\n-0.71\n-0.71\n\n0.31\n0.43\n\n0.31\n\n0.31\n\n-0.20\n0.08\n\n-0.20\n\n-0.20\n\n-0.71\n-0.28\n\n-0.71\n\n-0.71\n\n0.31\n\n-0.20\n\n-0.71\n\n0.28\n0.28\n\n-0.26\n-0.26\n\n-0.80\n-0.80\n\n0.28\n\n0.28\n\n0.28\n\n-0.26\n\n-0.26\n\n-0.26\n\n-0.80\n\n-0.80\n\n-0.80\n\n0.28\n\n-0.26\n\n-0.80\n\ns\ns\nP\nP\nG\nG\n\nd=16 q=16\nd=16 q=16\n\n-0.31\n-0.31\n\n-0.68\n-0.68\n\nd=16 q=16\nd=16 q=16\n\nd=16 q=16\n\nd=16 q=16\n\nd=16 q=16\nd=16 q=16\nd=16 q=16\nd=16 q=16\n-0.68\n-0.68\n-0.68\n-0.68\n-0.39\n3\n256\n3\n512\n256\n256\n512\n256\n256\n3\nNum. evaluations\nNum. evaluations\n\n3\n\n3\n\n3\n3\n\n768\n\n512\n\n256\n\n1024\n\nUCB w/RS\n\nUCB w/RS\nUCB w/RS\n\nRandom Search:\n\n Joint\nUCB w/RS\u21e4\nUCB w/RS\u21e4\nUCB w/RS\nUCB w/RS\n\nNum. evaluations\nUCB w/RS\n\n768\n1024\n1024\n768\nNum. evaluations\nNum. evaluations\nGreedy\nUCB w/SGD\u21e4\nUCB w/SGD\u21e4\nUCB w/SGD\nUCB w/SGD\nFigure 3: Average performance of different acquisition maximizers on synthetic tasks from a known prior,\ngiven varied runtimes when maximizing Monte Carlo q-EI. Reported values indicate the log of the immediate\nregret log10 |fmax  f (x\u21e4)|, where x\u21e4 denotes the observed maximizer x\u21e4 2 arg maxx2D \u02c6y.\n\n768\n768\nNum. evaluations\nGreedy\nJoint\nUCB w/CMA-ES\u21e4\nUCB w/CMA-ES\u21e4\nEI w/CMA-ES\u21e4\nUCB w/CMA-ES\n\n256\n3\nNum. evaluations\nNum. evaluations\nUCB w/CMA-ES\nUCB w/CMA-ES\n\n768\n1024\n1024\n768\nNum. evaluations\nNum. evaluations\n\nJoint\nUCB w/CMA-ES\u21e4\nUCB w/SGD\nUCB w/SGD\nUCB w/SGD\n\nCMA-ES:\nUCB w/RS\u21e4\nEI w/RS\u21e4\nUCB w/CMA-ES\nUCB w/CMA-ES\n\n256\n3\nNum. evaluations\nNum. evaluations\n\n768\n768\nNum. evaluations\n\nUCB w/CMA-ES\n\nGreedy\nUCB w/RS\u21e4\n\nUCB w/RS\u21e4\n\nUCB w/RS\u21e4\n\nUCB w/SGD\n\nEI w/CMA-ES\n\n1024\n1024\n\n1024\n1024\n\n512\n512\n\n512\n512\n\n512\n256\n\n512\n256\n\nEI w/RS\n\n3\n768\n3\n\n3\n768\n3\n\n512\n\n512\n\n-1.21\n-1.21\n768\n512\n\nd=16 q=16\n\nd=16 q=16\nd=16 q=16\n\nd=16 q=16\n\nd=16 q=16\nd=16 q=16\nd=16 q=16\nd=16 q=16\n-1.21\n-1.21\n-1.21\n-1.21\n-0.63\n3\n1024\n256\n3\n512\n256\n256\n1024\n768\n1024\n512\n256\n256\n3\nNum. evaluations\nNum. evaluations\nStochastic Gradient Ascent:\nUCB w/CMA-ES\u21e4\nUCB w/CMA-ES\u21e4\nUCB w/CMA-ES\u21e4\nEI w/SGD\u21e4\n\nEI w/SGD\n\n3\n\n-1.34\n-1.34\n768\n512\n\nd=16 q=16\nd=16 q=16\n\nd=16 q=16\n\nd=16 q=16\n\n-1.34\n-1.34\n-1.34\n3\n1024\n3\n256\n768\n256\n1024\n1024\n256\nNum. evaluations\nNum. evaluations\n\nd=16 q=16\n-1.34\n256\n512\n512\n\n3\n\nd=16 q=16\n\n512\n\n256\n3\nNum. evaluations\n\n768\n1024\n1024\nNum. evaluations\nNum. evaluations\n\n768\n768\nNum. evaluations\n\n512\n256\n\n768\n512\n\n512\n\n768\n\nUCB w/SGD\u21e4\n\nUCB w/SGD\u21e4\n\nUCB w/SGD\u21e4\n\nUCB w/SGD\u21e4\n\nIn additional experiments, we observed that optimization of PI and SR behaved like that of EI and\nUCB, respectively. However, overall performance using these acquisition functions was slightly\nworse, so further results are not reported here. Across experiments, the q-UCB acquisition function\nintroduced in Section 3.1 outperformed q-EI on all tasks but the Levy function.\nGenerally speaking, MC estimators Lm come in both deterministic and stochastic varieties. Here,\ndeterminism refers to whether or not each of m samples yk were generated using the same random\nvariates zk within a given outer-loop iteration (see Section 3.1). Together with a decision regard-\ning \u201cbatch-size\u201d m, this choice re\ufb02ects a well-known tradeoff of approximation-, estimation-, and\noptimization-based sources of error when maximizing the true function L [6]. We explored this\ntradeoff for each maximizer and summarize our \ufb01ndings below.\n\nMaximizers We considered a range of (acquisition) maximizers, ultimately settling on stochastic\ngradient ascent (ADAM, [36]), Covariance Matrix Adaptation Evolution Strategy (CMA-ES, [26])\nand Random Search (RS, [4]). Additional information regarding these choices is provided in Ap-\npendix B.1. For fair comparison, maximizers were constrained by CPU runtime. At each outer-loop\niteration, an \u201cinner budget\u201d was de\ufb01ned as the average time taken to simultaneously evaluate N\nacquisition values given equivalent conditions. When using greedy parallelism, this budget was split\nevenly among each of q iterations. To characterize performance as a function of allocated runtime,\nexperiments were run using inner budgets N 2{ 212, 214, 216}.\nFor ADAM, we used stochastic minibatches consisting of m = 128 samples and an initial learning\nrate \u2318 = 1/40. To combat non-convexity, gradient ascent was run from a total of 32 (64) starting posi-\ntions when greedily (jointly) maximizing L. Appendix B.2 details the multi-start initialization strat-\negy. As with the gradient-based approaches, CMA-ES performed better when run using stochastic\nminibatches (m = 128). Furthermore, reusing the aforementioned initialization strategy to generate\nCMA-ES\u2019s initial population of 64 samples led to additional performance gains.\n\nEmpirical results Figures 3 and 4 present key results regarding BO performance under varying\nconditions. Both sets of experiments explored an array of input dimensionalities d and degrees of\nparallelism q (shown in the lower left corner of each panel). Maximizers are grouped by color, with\ndarker colors denoting use of greedy parallelism; inner budgets are shown in ascending order from\nleft to right.\nResults on synthetic tasks (Figure 3), provide a clearer picture of the maximizers\u2019 impacts on the\nfull BO loop by eliminating the model mismatch. Across all dimensions d (rows) and inner budgets\n\n8\n\n\fEquivalent budget N = 212\n\n0.46\n0.46\n\nEquivalent budget N = 212\nEquivalent budget N = 212\n0.46\n\nEquivalent budget N = 212\nEquivalent budget N = 212\nEquivalent budget N = 214\n0.46\n\nEquivalent budget N = 212\n0.47\n0.47\n\nEquivalent budget N = 212\n\n0.46\n\n0.46\n\nEquivalent budget N = 214\nEquivalent budget N = 214\n0.47\n\nEquivalent budget N = 214\nEquivalent budget N = 214\nEquivalent budget N = 216\n0.47\n\nEquivalent budget N = 214\n0.47\n0.47\n\nEquivalent budget N = 214\n\n0.47\n\n0.47\n\nEquivalent budget N = 216\nEquivalent budget N = 216\n0.47\n\n0.47\n\n0.47\n\n0.47\n\nEquivalent budget N = 216\n\nEquivalent budget N = 216\n\nEquivalent budget N = 216\n\nEquivalent budget N = 216\n\n-0.20\n-0.20\n\n-0.87\n-0.87\n\n-1.54\n-1.54\n\n48\n64\n64\n\n-2.21\n-2.21\n48\n3\n32\n48\n3\n0.62\n0.62\n\nd=4\nd=4\n\n64\n\n0.02\n0.02\n\n-0.58\n-0.58\n\n-1.18\n-1.18\n\n-0.20\n\n-0.20\n\n-0.20\n\n-0.87\n\n-0.87\n\n-0.87\n\n-1.54\n\n-1.54\n\n-1.54\n\n-0.20\n\n-0.87\n\n-1.54\n\nq=4\n\nd=4\n\nq=4\nd=4\nq=4\nd=4\nq=4\nd=6 q=16\n-2.21\n-2.21\n-2.21\n-2.21\n3\n16\n64\n16\n3\n32\n16\n48\n64\n16\n32\n0.62\n0.62\n0.62\n0.62\n\n64\n3\n\nq=4\n\nd=4\n\nq=4\n\n3\n16\n\n32\n48\n48\n\n32\n16\n\n32\n\n0.02\n\n0.02\n\n0.02\n\n-0.58\n\n-0.58\n\n-0.58\n\n-1.18\n\n-1.18\n\n-1.18\n\n0.02\n\n-0.58\n\n-1.18\n\n-0.23\n-0.23\n\n-0.93\n-0.93\n\n-1.63\n-1.63\n\n48\n64\n64\n\n-2.33\n-2.33\n48\n3\n32\n48\n3\n0.63\n0.63\n\nd=4\nd=4\n\n64\n\n0.02\n0.02\n\n-0.59\n-0.59\n\n-1.20\n-1.20\n\n-0.23\n\n-0.23\n\n-0.23\n\n-0.93\n\n-0.93\n\n-0.93\n\n-1.63\n\n-1.63\n\n-1.63\n\n-0.23\n\n-0.93\n\n-1.63\n\nd=4\n\nq=4\n\nd=4\n\nq=4\n\nd=4\nq=4\nq=4\n-2.33\n-2.33\n-2.33\n3\n64\n3\n16\n48\n64\n16\n0.63\n0.63\n0.63\n\nq=4\n\nq=4\nd=4\n-2.33\n16\n16\n32\n32\n0.63\n\n64\n3\n\n3\n16\n\n32\n48\n48\n\n32\n16\n\n32\n\n48\n64\n64\n\n48\n32\n\n48\n\n0.02\n\n0.02\n\n0.02\n\n-0.59\n\n-0.59\n\n-0.59\n\n-1.20\n\n-1.20\n\n-1.20\n\n0.02\n\n-0.59\n\n-1.20\n\n6\n-\nn\nn\na\nm\nt\nr\na\nH\n\n3\n\n \n.\n\n \n\no\nN\ny\nv\ne\nL\n\n)\nr\no\ni\nr\np\n\n \n\nn\nw\no\nn\nk\nn\nu\n(\n \ns\nP\nG\n\nd=6 q=4\n\nd=4 q=4\n\nd=4 q=4\n\n-0.19\n\n4\nR\nn\n\ni\n\ns\nP\nG\n\ni\n\n-0.19\n-0.19\n4\n4\nR\nR\nn\nn\n-0.85\n-0.85\ns\ns\nP\nP\nG\nG\n-1.50\n-1.50\n\ni\n\n-0.85\n\n-1.50\n\n4\nR\nn\n\ni\n\ns\nP\nG\n\n-0.19\n\n-0.85\n\n-1.50\n\nd=4\nd=4\n\nq=4\n\nd=4\n\nq=4\n\nd=4\n\nq=4\nq=4\nd=4\nq=4\nd=4\nq=4\nd=6 q=8\n-2.16\n-2.16\n-2.16\n-2.16\n3\n16\n16\n3\n16\n32\n32\n16\n0.62\n0.62\n0.62\n0.62\n\n3\n\n3\n16\n\n32\n48\n48\n\n32\n16\n\n32\n\n0.04\n\n8\nR\nn\n\ni\n\ns\nP\nG\n\ni\n\n0.04\n\n0.04\n8\n8\nR\nR\nn\nn\n-0.53\n-0.53\ns\ns\nP\nP\nG\nG\n-1.10\n-1.10\n\ni\n\n-0.53\n\n-1.10\n\n8\nR\nn\n\ni\n\ns\nP\nG\n\n0.04\n\n-0.53\n\n-1.10\n\nd=8\n\nq=8\nq=8\nd=8\nq=8\nd=8\nq=8\nd=8 q=8\n-1.68\n-1.68\n-1.68\n-1.68\n3\n64\n3\n128\n64\n128\n64\n0.79\n0.79\n0.79\n0.79\n\n3\n\n-0.19\n-0.19\n\n-0.85\n-0.85\n\n4\n4\nR\nR\nn\nn\n\ni\n\ni\n\ns\ns\nP\nP\nG\nG\n\n-1.50\n-1.50\n\n-2.16\n-2.16\n\n0.62\n0.62\n\n3\n3\n\n0.04\n0.04\n\n-0.53\n-0.53\n\n8\n8\nR\nR\nn\nn\n\ni\n\ni\n\ns\ns\nP\nP\nG\nG\n\n-1.10\n-1.10\n\n-1.68\n-1.68\n\n0.79\n0.79\n\n3\n3\n\n6\n6\n1\n1\nR\nR\nn\nn\n\ni\n\ni\n\ns\ns\nP\nP\nG\nG\n\n0.42\n0.42\n\n0.06\n0.06\n\n-0.31\n-0.31\n\n-0.68\n-0.68\n\nd=8\nd=8\n\nq=8\n\nd=8\n\nq=8\n\nd=8\nd=8\n\nq=8\n\nd=8\n\nq=8\n\nd=8\nd=8\n\n64\n3\n\n64\n\n192\n192\n\n128\n\n128\n64\n\n128\n\n256\n256\n\n192\n\n3\n192\n3\n\n256\n\n64\n3\n\n64\n\n192\n192\n\n128\n\n128\n64\n\n128\n\n256\n256\n\n192\n\n3\n192\n3\n\n256\n\n-1.78\n-1.78\n192\n128\n0.81\n0.81\n\nq=8\n\nd=8\n\nq=8\nd=8\nq=8\nd=8\nq=8\nd=16 q=16\n-1.78\n-1.78\n-1.78\n-1.78\n3\n256\n64\n3\n128\n64\n256\n192\n128\n64\n0.81\n0.81\n0.81\n0.81\n\n256\n3\n\n-1.81\n-1.81\n192\n128\n0.82\n0.82\n\nd=8\nq=8\nq=8\n-1.81\n-1.81\n-1.81\n3\n256\n3\n64\n256\n192\n64\n0.82\n0.82\n0.82\n\nd=8\n\nq=8\n\nq=8\n\nq=8\nd=8\n-1.81\n64\n128\n128\n0.82\n\n64\n3\n\n256\n3\n\nd=8\n\nq=8\n\n64\n\n192\n192\n\n128\n\n128\n64\n\n128\n\n256\n256\n\n192\n\n192\n128\n\n192\n\n0.42\n\n0.06\n\n6\n1\nR\nn\n\ni\n\ns\nP\nG\n\n0.42\n\n0.42\n6\n6\n1\n1\nR\nR\nn\nn\n0.06\n0.06\ns\ns\nP\nP\nG\nG\n-0.31\n-0.31\n\ni\n\ni\n\n-0.31\n\n6\n1\nR\nn\n\ni\n\ns\nP\nG\n\n0.42\n\n0.06\n\n-0.31\n\n0.31\n0.31\n\n-0.20\n-0.20\n\n-0.71\n-0.71\n\n0.31\n\n0.31\n\n0.31\n\n-0.20\n\n-0.20\n\n-0.20\n\n-0.71\n\n-0.71\n\n-0.71\n\n0.31\n\n-0.20\n\n-0.71\n\n0.28\n0.28\n\n-0.26\n-0.26\n\n-0.80\n-0.80\n\n0.28\n\n0.28\n\n0.28\n\n-0.26\n\n-0.26\n\n-0.26\n\n-0.80\n\n-0.80\n\n-0.80\n\n0.28\n\n-0.26\n\n-0.80\n\nd=16 q=16\nd=16 q=16\n\nd=16 q=16\n\nd=16 q=16\n\nd=16 q=16\nd=16 q=16\nd=8 q=8\n-0.68\n-0.68\n-0.68\n-0.68\n3\n3\n256\n512\n256\n256\n512\n256\nNum. evaluations\nNum. evaluations\n\n3\n\n3\n3\n\nUCB w/RS\nUCB w/RS\n\nUCB w/RS\n\nUCB w/RS\n\nRandom Search:\n\n Joint\nUCB w/RS\u21e4\nUCB w/RS\u21e4\nUCB w/RS\nUCB w/RS\n\n768\n1024\n1024\nNum. evaluations\nNum. evaluations\nGreedy\nUCB w/SGD\u21e4\nUCB w/SGD\u21e4\nUCB w/SGD\nUCB w/SGD\nFigure 4: Average performance of different acquisition maximizers on black-box tasks from an unknown prior,\ngiven varied runtimes when maximizing Monte Carlo q-EI. Reported values indicate the log of the immediate\nregret log10 |fmax  f (x\u21e4)|, where x\u21e4 denotes the observed maximizer x\u21e4 2 arg maxx2D \u02c6y.\n\n768\n768\nNum. evaluations\nGreedy\nJoint\nUCB w/CMA-ES\u21e4\nUCB w/CMA-ES\u21e4\nUCB w/CMA-ES\n\n768\n1024\n1024\nNum. evaluations\nNum. evaluations\n\nJoint\nUCB w/CMA-ES\u21e4\nUCB w/SGD\nUCB w/SGD\nUCB w/SGD\n\n768\n768\nNum. evaluations\n\n256\n3\nNum. evaluations\n\n256\n3\nNum. evaluations\n\nCMA-ES:\nUCB w/RS\u21e4\nUCB w/CMA-ES\nUCB w/CMA-ES\n\nUCB w/CMA-ES\n\nUCB w/CMA-ES\n\nUCB w/CMA-ES\n\nGreedy\nUCB w/RS\u21e4\n\nUCB w/RS\u21e4\n\nUCB w/RS\u21e4\n\nUCB w/SGD\n\n1024\n\n1024\n\n512\n256\n\n512\n256\n\n3\n768\n3\n\n3\n768\n3\n\n512\n\n512\n\n512\n\n512\n\n3\n\n-1.21\n-1.21\n768\n512\n\nd=16 q=16\n\nd=16 q=16\nd=16 q=16\n\nd=16 q=16\n\nd=16 q=16\nd=16 q=16\nd=16 q=16\n-1.21\n-1.21\n-1.21\n-1.21\n3\n1024\n3\n256\n512\n256\n768\n256\n1024\n1024\n512\n256\nNum. evaluations\nNum. evaluations\nStochastic Gradient Ascent:\nUCB w/CMA-ES\u21e4\nUCB w/CMA-ES\u21e4\n\nUCB w/CMA-ES\u21e4\n\n-1.34\n-1.34\n768\n512\n\nd=16 q=16\nd=16 q=16\n\nd=16 q=16\n\nd=16 q=16\n\n-1.34\n-1.34\n-1.34\n3\n1024\n3\n256\n768\n256\n1024\n1024\n256\nNum. evaluations\nNum. evaluations\n\nd=16 q=16\n-1.34\n256\n512\n512\n\n3\n\nd=16 q=16\n\n512\n\n256\n3\nNum. evaluations\n\n768\n1024\n1024\nNum. evaluations\nNum. evaluations\n\n768\n768\nNum. evaluations\n\n512\n256\n\n768\n512\n\n512\n\n768\n\nUCB w/SGD\u21e4\n\nUCB w/SGD\u21e4\n\nUCB w/SGD\u21e4\n\nUCB w/SGD\u21e4\n\nN (columns), gradient-based maximizers (orange) were consistently superior to both gradient-free\n(blue) and na\u00efve (green) alternatives. Similarly, submodular maximizers generally surpassed their\njoint counterparts. However, in lower-dimensional cases where gradients alone suf\ufb01ce to optimize\nLm, the bene\ufb01ts for coupling gradient-based strategies with near-optima seeking submodular maxi-\nmization naturally decline. Lastly, the bene\ufb01ts of exploiting gradients and submodularity both scaled\nwith increasing acquisition dimensionality q \u21e5 d.\nTrends are largely identical for black-box tasks (Figure 4), and this commonality is most evident\nfor tasks sampled from an unknown GP prior (\ufb01nal row). These runs were identical to ones on syn-\nthetic tasks (speci\ufb01cally, the diagonal of Figure 3) but where knowledge of f\u2019s prior was withheld.\nOutcomes here clarify the impact of model mismatch, showing how maximizers maintain their in-\n\ufb02uence. Finally, performance on Hartmann-6 (top row) serves as a clear indicator of the importance\nfor thoroughly solving the inner optimization problem. In these experiments, performance improved\ndespite mounting parallelism due to a corresponding increase in the inner budget.\nOverall, these results clearly demonstrate that both gradient-based and submodular approaches to\n(parallel) query optimization lead to reliable and, often, substantial improvement in outer-loop per-\nformance. Furthermore, these gains become more pronounced as the acquisition dimensionality\nincreases. Viewed in isolation, maximizers utilizing gradients consistently outperform gradient-free\nalternatives. Similarly, greedy strategies improve upon their joint counterparts in most cases.\n\n5 Conclusion\n\nBO relies upon an array of powerful tools, such as surrogate models and acquisition functions, and\nall of these tools are sharpened by strong usage practices. We extend these practices by demonstrat-\ning that Monte Carlo acquisition functions provide unbiased gradient estimates that can be exploited\nwhen optimizing them. Furthermore, we show that many of the same acquisition functions form\na family of submodular set functions that can be ef\ufb01ciently optimized using greedy maximization.\nThese insights serve as cornerstones for easy-to-use, general-purpose techniques for practical BO.\nComprehensive empirical evidence concludes that said techniques lead to substantial performance\ngains in real-world scenarios where queries must be chosen in \ufb01nite time. By tackling the inner opti-\nmization problem, these advances directly bene\ufb01t the theory and practice of Bayesian optimization.\n\n9\n\n364128192256-1.53-1.02-0.500.010.52EquivalentbudgetN=2364128192256-1.73-1.17-0.60-0.040.53EquivalentbudgetN=2364128192256-1.74-1.17-0.60-0.040.53EquivalentbudgetN=2316324864-0.69-0.200.290.771.26364128192256-0.210.290.791.281.78325651276810240.230.731.231.732.22316324864Num.evaluations-1.89-1.29-0.69-0.090.51364128192256Num.evaluations-1.36-0.86-0.350.160.6732565127681024Num.evaluations-0.59-0.230.120.470.82d=6q=4d=6q=8d=6q=16d=4q=4d=8q=8d=16q=16d=4q=4d=8q=8d=16q=16\fAcknowledgments\n\nThe authors thank David Ginsbourger, Dario Azzimonti and Henry Wynn for initial discussions\nregarding the submodularity of various integrals. The support of the EPSRC Centre for Doctoral\nTraining in High Performance Embedded and Distributed Systems (reference EP/L016796/1) is\ngratefully acknowledged. This work has partly been supported by the European Research Coun-\ncil (ERC) under the European Union\u2019s Horizon 2020 research and innovation programme under\ngrant no. 716721.\n\nReferences\n[1] J. Azimi, A. Fern, and X.Z. Fern. Batch Bayesian optimization via simulation matching. In Advances in\n\nNeural Information Processing Systems, 2010.\n\n[2] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and\n\n[3] S. Bansal, R. Calandra, T. Xiao, S. Levine, and C.J. Tomlin. Goal-driven dynamics learning via Bayesian\n\nTrends R in Machine Learning, 6(2-3), 2013.\noptimization. arXiv preprint arXiv:1703.09260, 2017.\n\n[4] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learn-\n\ning Research, 2012.\n\n[5] S. Bochner. Lectures on Fourier Integrals. Number 42. Princeton University Press, 1959.\n[6] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In Advances in Neural Information\n\nProcessing Systems, 2008.\n\n[7] R. Calandra, A. Seyfarth, J. Peters, and M.P. Deisenroth. Bayesian optimization for learning gaits under\n\nuncertainty. Annals of Mathematics and Arti\ufb01cial Intelligence, 76(1-2), 2016.\n\n[8] X. Cao. Convergence of parameter sensitivity estimates in a stochastic experiment. IEEE Transactions\n\non Automatic Control, 30(9), 1985.\n\n[9] Y. Chen and A. Krause. Near-optimal batch mode active learning and adaptive submodular optimization.\n[10] C. Chevalier and D. Ginsbourger. Fast computation of the multi-points expected improvement with appli-\ncations in batch selection. In International Conference on Learning and Intelligent Optimization, 2013.\n[11] R. Christian. The Bayesian choice: from decision-theoretic foundations to computational implementation.\n\nSpringer Science & Business Media, 2007.\n\n[12] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis. Parallel Gaussian process optimization with up-\nper con\ufb01dence bound and pure exploration. In Joint European Conference on Machine Learning and\nKnowledge Discovery in Databases, 2013.\n\n[13] J.P. Cunningham, P. Hennig, and S. Lacoste-Julien. Gaussian probabilities and expectation propagation.\n\narXiv preprint arXiv:1111.6832, 2011.\n\n[14] M.H. DeGroot. Optimal statistical decisions, volume 82. John Wiley & Sons, 2005.\n[15] T. Desautels, A. Krause, and J.W. Burdick. Parallelizing exploration-exploitation tradeoffs in Gaussian\n\nprocess bandit optimization. Journal of Machine Learning Research, 2014.\n\n[16] S. Falkner, A. Klein, and F. Hutter. BOHB: Robust and ef\ufb01cient hyperparameter optimization at scale. In\n\nInternational Conference on Machine Learning, 2018.\n\n[17] P.I. Frazier and J. Wang. Bayesian optimization for materials design. In Information Science for Materials\n\nDiscovery and Design. 2016.\n\n[18] H.I. Gassmann, I. De\u00e1k, and T. Sz\u00e1ntai. Computing multivariate normal probabilities: A new look.\n\nJournal of Computational and Graphical Statistics, 11(4), 2002.\n\n[19] M.A. Gelbart, J. Snoek, and R.P. Adams. Bayesian optimization with unknown constraints. arXiv preprint\n\narXiv:1403.5607, 2014.\n\n[20] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and\n\nGraphical Statistics, 1992.\n\n[21] A. Genz. Numerical computation of rectangular bivariate and trivariate normal and t probabilities. Statis-\n\ntics and Computing, 14(3), 2004.\n\n[22] D. Ginsbourger, R. Le Riche, and L. Carraro. Kriging is well-suited to parallelize optimization, chapter 6.\n\nSpringer, 2010.\n\n10\n\n\f[23] D. Ginsbourger, J. Janusevskis, and R. Le Riche. Dealing with asynchronicity in parallel Gaussian process\nbased global optimization. In International Conference of the ERCIM WG on Computing & Statistics,\n2011.\n\n[24] P. Glasserman. Performance continuity and differentiability in Monte Carlo optimization. In Simulation\n\nConference Proceedings, 1988 Winter. IEEE, 1988.\n\n[25] I.S. Gradshteyn and I.M. Ryzhik. Table of integrals, series, and products. Academic press, 2014.\n[26] N. Hansen. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.\n[27] P. Hennig and C. Schuler. Entropy search for information-ef\ufb01cient global optimization. Journal of\n\nMachine Learning Research, 2012.\n\n[28] J. Hern\u00e1ndez-Lobato, M. Hoffman, and Z. Ghahramani. Predictive entropy search for ef\ufb01cient global\n\noptimization of black-box functions. In Advances in Neural Information Processing Systems, 2014.\n\n[29] F. Hutter, H.H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm\n\ncon\ufb01guration. In International Conference on Learning and Intelligent Optimization. Springer, 2011.\n\n[30] K. Jamieson and A. Talwalkar. Non-stochastic best arm identi\ufb01cation and hyperparameter optimization.\n\nIn Arti\ufb01cial Intelligence and Statistics, 2016.\n\n[31] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with Gumbel-Softmax. arXiv preprint\n\narXiv:1611.01144, 2016.\n\n[32] D. Jones, M. Schonlau, and W. Welch. Ef\ufb01cient global optimization of expensive black box functions.\n\nJournal of Global Optimization, 13:455\u2013492, 1998.\n\n[33] D.R. Jones, C.D. Perttunen, and B.E. Stuckman. Lipschitzian optimization without the Lipschitz constant.\n\nJournal of Optimization Theory and Applications, 1993.\n\n[34] Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In International\n\nConference on Machine Learning, 2013.\n\n[35] T. Kathuria, A. Deshpande, and P. Kohli. Batched Gaussian process bandit optimization via determinantal\n\npoint processes. In Advances in Neural Information Processing Systems, 2016.\n\n[36] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[37] D.P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning\n\nRepresentations, 2014.\n\n[38] S. Kotz and S. Nadarajah. Multivariate t-distributions and their applications. Cambridge University\n\nPress, 2004.\n\n[39] A. Krause and D. Golovin. Submodular function maximization, 2014.\n[40] H.J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the\n\npresence of noise. Journal of Basic Engineering, 86(1), 1964.\n\n[41] C.J. Maddison, A. Mnih, and Y.W. Teh. The concrete distribution: A continuous relaxation of discrete\n\nrandom variables. arXiv preprint arXiv:1611.00712, 2016.\n\n[42] R. Martinez-Cantin. Bayesopt: A Bayesian optimization library for nonlinear optimization, experimental\n\ndesign and bandits. Journal of Machine Learning Research, 15(1), 2014.\n\n[43] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization\n\nTechniques. 1978.\n\n[44] J. Mo\u02c7ckus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical\n\nConference. Springer, 1975.\n\n[45] J. Mo\u02c7ckus. Application of Bayesian approach to numerical methods of global and stochastic optimization.\n\nJournal of Global Optimization, 4(4), 1994.\n\n[46] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of approximations for maximizing submod-\n\nular set functions\u2014I. Mathematical Programming, 14(1), 1978.\n\n[47] M.A. Osborne, R. Garnett, and S.J. Roberts. Gaussian processes for global optimization. In International\n\nConference on Learning and Intelligent Optimization, 2009.\n\n[48] A. Rahimi and B. Recht. Random features for large-scale kernel machines.\n\nInformation Processing Systems, 2008.\n\nIn Advances in Neural\n\n[49] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n\n11\n\n\f[50] D.J. Rezende, M. Shakir, and D. Wierstra. Stochastic backpropagation and variational inference in deep\n\nlatent Gaussian models. In International Conference on Machine Learning, 2014.\n\n[51] A. Shah and Z. Ghahramani. Parallel predictive entropy search for batch global optimization of expensive\n\nobjective functions. In Advances in Neural Information Processing Systems, 2015.\n\n[52] B. Shahriari, K. Swersky, Z. Wang, R.P. Adams, and N. de Freitas. Taking the human out of the loop: A\n\nReview of Bayesian Optimization. Proceedings of the IEEE, (1), 2016.\n\n[53] J. Snoek, H. Larochelle, and R.P. Adams. Practical Bayesian optimization of machine learning algorithms.\n\nIn Advances in Neural Information Processing Systems 25, 2012.\n\n[54] J.T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust Bayesian neural\n\nnetworks. In Advances in Neural Information Processing Systems, 2016.\n\n[55] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting:\n\nNo regret and experimental design. In International Conference on Machine Learning, 2010.\n\n[56] K. Swersky, J. Snoek, and R.P. Adams. Multi-task Bayesian optimization. In Advances in Neural Infor-\n\nmation Processing Systems, 2013.\n\n[57] T. Ueno, T.D. Rhone, Z. Hou, T. Mizoguchi, and K. Tsuda. Combo: An ef\ufb01cient Bayesian optimization\n\nlibrary for materials science. Materials discovery, 4, 2016.\n\n[58] F. Viana and R. Haftka. Surrogate-based optimization with parallel simulations using the probability of\n\nimprovement. In AIAA/ISSMO Multidisciplinary Analysis Optimization Conference, 2010.\n\n[59] J. Wang, S.C. Clark, E. Liu, and P.I. Frazier. Parallel Bayesian global optimization of expensive functions.\n\narXiv preprint arXiv:1602.05149, 2016.\n\n[60] Z. Wang and S. Jegelka. Max-value entropy search for ef\ufb01cient Bayesian optimization. In International\n\nConference on Machine Learning, 2017.\n\n[61] J. Wu and P.I. Frazier. The parallel Knowledge Gradient method for batch Bayesian optimization. In\n\nAdvances in Neural Information Processing Systems, 2016.\n\n[62] J. Wu, M. Poloczek, A.G. Wilson, and P.I. Frazier. Bayesian optimization with gradients. In Advances in\n\nNeural Information Processing Systems, pages 5267\u20135278, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6437, "authors": [{"given_name": "James", "family_name": "Wilson", "institution": "Imperial College of London"}, {"given_name": "Frank", "family_name": "Hutter", "institution": "University of Freiburg"}, {"given_name": "Marc", "family_name": "Deisenroth", "institution": "Imperial College London"}]}