{"title": "Bayesian Optimization with Unknown Search Space", "book": "Advances in Neural Information Processing Systems", "page_first": 11795, "page_last": 11804, "abstract": "Applying Bayesian optimization in problems wherein the search space is unknown is challenging. To address this problem, we propose a systematic volume expansion strategy for the Bayesian optimization. We devise a strategy to guarantee that in iterative expansions of the search space, our method can find a point whose function value within epsilon of the objective function maximum. Without the need to specify any parameters, our algorithm automatically triggers a minimal expansion required iteratively. We derive analytic expressions for when to trigger the expansion and by how much to expand. We also provide theoretical analysis to show that our method achieves epsilon-accuracy after a finite number of iterations. We demonstrate our method on both benchmark test functions and machine learning hyper-parameter tuning tasks and demonstrate that our method outperforms baselines.", "full_text": "Bayesian Optimization with Unknown Search Space\n\nHuong Ha, Santu Rana, Sunil Gupta, Thanh Nguyen, Hung Tran-The, Svetha Venkatesh\n\n{huong.ha, santu.rana, sunil.gupta, thanhnt, hung.tranthe, svetha.venkatesh}@deakin.edu.au\n\nApplied Arti\ufb01cial Intelligence Institute (A2I2)\n\nDeakin University, Geelong, Australia\n\nAbstract\n\nApplying Bayesian optimization in problems wherein the search space is unknown\nis challenging. To address this problem, we propose a systematic volume expansion\nstrategy for the Bayesian optimization. We devise a strategy to guarantee that in\niterative expansions of the search space, our method can \ufb01nd a point whose function\nvalue within \u0001 of the objective function maximum. Without the need to specify\nany parameters, our algorithm automatically triggers a minimal expansion required\niteratively. We derive analytic expressions for when to trigger the expansion and by\nhow much to expand. We also provide theoretical analysis to show that our method\nachieves \u0001-accuracy after a \ufb01nite number of iterations. We demonstrate our method\non both benchmark test functions and machine learning hyper-parameter tuning\ntasks and demonstrate that our method outperforms baselines.\n\n1\n\nIntroduction\n\nChoosing where to search matters. A time-tested path in the quest for new products or processes\nis through experimental optimization. Bayesian optimization offers a sample ef\ufb01cient strategy for\nexperimental design by optimizing expensive black-box functions [9\u201311]. But one problem is that\nusers need to specify a bounded region to restrict the search of the objective function extrema. When\ntackling a completely new problem, users do not have prior knowledge, hence there is no guarantee\nthat an arbitrarily de\ufb01ned search space contains the global optimum. Thus application of the Bayesian\noptimization framework when the search region is unknown remains an open challenge [16].\nOne approach is to use a regularized acquisition function such that its maximum can never be at\nin\ufb01nity - hence no search space needs to be declared and an unconstrained optimizer can be used [16].\nOther approaches use volume expansion, i.e. starting from the user-de\ufb01ned region, the search space\nis expanded during the optimization. The simplest strategy is to repeatedly double the volume of the\nsearch space every several iterations [16]. Nguyen et al suggest a volume expansion strategy based on\nthe evaluation budget [12]. All these methods require users to specify critical parameters - as example,\nregularization parameters [16], or growth rate, expansion frequency (volume doubling) [16] or budget\n[12]. These parameters are dif\ufb01cult to specify in practice. Additionally, [12] is computationally\nexpensive and the user-de\ufb01ned search space needs to be close to the global optimum.\nIn this paper, we propose a systematic volume expansion strategy for the Bayesian optimization\nframework wherein the search space is unknown. Without any prior knowledge about the objective\nfunction argmax or strict assumptions on the behavior of the objective function, it is impossible to\nguarantee the global convergence when the search space is continuously expanded. To circumvent\nthis problem, we consider the setting where we achieve the global \u0001-accuracy condition, that is, we\naim to \ufb01nd a point whose function value is within \u0001 of the objective function global maximum.\nOur volume expansion strategy is based on two guiding principles: 1) The algorithm can reach a\npoint whose function value is within \u0001 of the objective function maximum in one expansion, and, 2)\nthe search space should be minimally expanded so that the algorithm does not spend unnecessary\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fevaluations near the search space boundary. As the objective function is unknown, it is not possible\nto compute this ideal expansion region. Using the GP-UCB acquisition function as a surrogate,\nthis region is computed as one that contains at least one point whose acquisition function value\nis within \u0001 of the acquisition function maximum. However, by using a surrogate to approximate\nthe objective function, there is no guarantee that we can achieve the global \u0001-accuracy within one\nexpansion. Hence multiple expansions are required, and a new expansion is triggered when the local\n\u0001-accuracy is satis\ufb01ed, i.e. when the algorithm can \ufb01nd a point whose function value is within \u0001\nof the objective function maximum in the current search space. Analytical expressions for the size\nof the new expansion space and when to trigger the expansion are derived. The guarantees for the\n\u0001-accuracy condition, however, now lapses in the expanded region, and so we adjust the acquisition\nfunction appropriately to maintain the guarantee. Finally, we provide theoretical analysis to show\nthat our proposed method achieves the global \u0001-accuracy condition after a \ufb01nite number of iterations.\nWe demonstrate our algorithm on \ufb01ve synthetic benchmark functions and three real hyperparameter\ntuning tasks for common machine learning models: linear regression with elastic net, multilayer\nperceptron and convolutional neural network. Our experimental results show that our method achieves\nbetter function values with fewer samples compared to state-of-the-art approaches. In summary, our\ncontributions are:\n\n\u2022 Formalising the analysis for Bayesian optimization framework in an unknown search space\n\nsetting, and introducing \u0001-accuracy as a way to track the algorithmic performance;\n\n\u2022 Providing analytic expressions for how far to expand the search space and when to expand\n\nthe search space to achieve global \u0001-accuracy;\n\n\u2022 Deriving theoretical global \u0001-accuracy convergence; and,\n\u2022 Demonstrating our algorithm on both synthetic and real-world problems and comparing it\n\nagainst state-of-the-art methods.\n\nOur method differs from previous works in that 1) our method does not require any algorithmic\nparameters, automatically adjusting both when to trigger the expansion and by how much to expand,\nand, 2) our approach is the only one to guarantee the global \u0001-accuracy condition. This is because\nwe guarantee the local \u0001-accuracy condition in each search space, thus eventually the global \u0001-\naccuracy is achieved. Without this local guarantee, the suggested solution cannot be guaranteed to\nreach global \u0001-accuracy. The regularization [16] and the \ufb01ltering method [12] require the global\noptimum to be within a bound constructed by either the user speci\ufb01ed regularizer or the budget. The\nvolume doubling method [16] can continue to expand the search space to in\ufb01nity, however, the local\n\u0001-accuracy condition is not guaranteed in each search space.\nThe paper is organized as follows. Section 2 gives an overview of Bayesian optimization and\ndiscusses some of the related work. Section 3 describes the problem setup. Section 4 proposes our\nnew expansion strategy for the Bayesian optimization framework when the search space is unknown.\nA theoretical analysis for our proposed method is presented in Section 5. In Section 6, we demonstrate\nthe effectiveness of our algorithm by numerical experiments. Finally, Section 7 concludes the paper.\n\n2 Background and Related Work\n\n2.1 Background\n\nBayesian optimization is a powerful optimization method to \ufb01nd the global optimum of an unknown\nobjective function f (x) by sequential queries [9\u201311, 17, 18]. First, at time t, a surrogate model is\nused to approximate the behaviour of f (x) using all the current observed data Dt\u22121 = {(xi, yi)}n\ni=1,\nyi = f (xi) + \u03bei, where \u03bei \u223c N (0, \u03c32) is the noise. Second, an acquisition function is constructed\nfrom the surrogate model that suggests the next point xitr to be evaluated. The objective function is\nthen evaluated at xitr and the new data point (xitr, yitr) is added to Dt\u22121. These steps are conducted\nin an iterative manner to get the best estimate of the global optimum.\nThe most common choice for the surrogate model used in Bayesian optimization is the Gaussian\nProcess (GP) [14]. Assume the function f follows a GP with mean function m0(x) and covariance\nfunction k(x, x(cid:48)), the posterior distribution of f given the observed data Dt\u22121 = {(xi, yi)}n\ni=1 is a\n\n2\n\n\fGP with the following posterior mean and variance,\n\n\u00b5t\u22121(x) = m0(x) + k|Dt\u22121|(x)T (K|Dt\u22121| + \u03c32I|Dt\u22121|)\u22121y|Dt\u22121|,\nt\u22121(x) = k(x, x) \u2212 k|Dt\u22121|(x)T (K|Dt\u22121| + \u03c32I|Dt\u22121|)\u22121k|Dt\u22121|(x),\n\u03c32\n\n(1)\n\n|Dt\u22121|\ni=1\n\nwhere y|Dt\u22121| = [y1, . . . , y|Dt\u22121|]T , k|Dt\u22121|(x) = [k(x, xi)]\n, K|Dt\u22121| = [k(xi, xj)]i,j, I|Dt\u22121|\nis the |Dt\u22121| \u00d7 |Dt\u22121| identity matrix and |Dt\u22121| denotes the cardinality of Dt\u22121. To aid readability,\nin the sequel we remove the notation that shows the dependence of k, K, I, y on |Dt\u22121|.\nThere are many existing acquisition functions [6, 7, 10, 11, 20] and in this paper, we focus only on\nthe GP-UCB acquisition function [1, 2, 5, 19]. The GP-UCB acquisition function is de\ufb01ned as,\n\n\u03b1U CB(x;Dt\u22121) = \u00b5t\u22121(x) +(cid:112)\u03b2t\u03c3t\u22121(x),\n\n(2)\nwhere \u00b5t\u22121(x), \u03c3t\u22121(x) are the posterior mean and standard deviation of the GP given observed data\nDt\u22121 and \u03b2t \u2265 0 is an appropriate parameter that balances the exploration and exploitation. Given a\nsearch domain, {\u03b2t} can be chosen as in [19] to ensure global convergence in this domain.\n\n2.2 Related Work\n\nAll the work related to the problem of Bayesian optimization with unknown search space have been\ndescribed in Section 1. There is the work in [3] introduces the term \u0001-accuracy. However, their\npurpose is to unify the Bayesian optimization and the Level-set estimation framework.\n\nxmax = argmaxx\u2208S\u2217 f (x),\n\n3 Problem Setup\nWe wish to \ufb01nd the global argmax xmax of an unknown objective function f : Rd (cid:55)\u2192 R, whose\nargmax is at a \ufb01nite location, i.e.\n(3)\nwhere S\u2217 is a \ufb01nite region that contains the argmax of the function f (x). In practice, the region S\u2217\nis not known in advance, so users need to identify a search domain Suser which is likely to contain\nthe argmax of f (x). This search domain can be set arbitrarily or based on limited prior knowledge.\nThus there is no guarantee that Suser contains the global optimum of the objective function. In the\ntrivial cases when the search space S\u2217 is known or when S\u2217 \u2282 Suser, the global convergence can be\nguaranteed through classical analysis [4, 19]. Here, we consider the general case when S\u2217 may or\nmay not be a subset of Suser. Without any prior knowledge about S\u2217 or strict assumptions on the\nbehavior of the objective function, it is impossible to guarantee the global convergence. Therefore, in\nthis work, instead of solving Eq. (3), we consider the setting where we achieve the global \u0001-accuracy\ncondition. That is, for a small positive value \u0001, we \ufb01nd a solution x\u0001 which satis\ufb01es,\n\nf (xmax) \u2212 f (x\u0001) \u2264 \u0001.\n\n(4)\n\n4 Proposed Approach\n\nWe make some mild assumptions to develop our main results.\nAssumption 4.1 The prior mean function m0(x) = 0.\nThis is done by subtracting the mean from all observations and is common practice.\nAssumption 4.2 The kernel k(x, x(cid:48)) satis\ufb01es, (1) when (cid:107)x \u2212 x(cid:48)(cid:107)2 \u2192 +\u221e,\nk(x, x(cid:48)) \u2264 1 \u2200(x, x(cid:48)) ; (3) k(x, x) = \u03b82, where \u03b8 \u2265 0 is the scale factor of the kernel function.\nVarious kernels satisfy Assumption 4.2, e.g. the Mat\u00e9rn kernel, the Square Exponential kernel. As\nthe function can always be re-scaled, condition 2 is met without loss of generality [15, 19].\nDe\ufb01ning gk(\u03b3): With these types of kernels, for all small positive \u03b3, there always exists gk(\u03b3) > 0,\n(5)\nThe value of gk(\u03b3) can be computed from \u03b3 and the kernel covariance function k(x, x(cid:48)) i.e. for\nSquared Exponential kernel kSE(x, x(cid:48)) = \u03b82exp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2\nAssumption 4.3 The kernel k(x, x(cid:48)) is known in advance or can be learned from the observations.\n\n2/(2l2)), gk(\u03b3) will be(cid:112)2l2log(\u03b82/\u03b3).\n\n\u2200x, x(cid:48) : (cid:107)x \u2212 x(cid:48)(cid:107)2 \u2265 gk(\u03b3),\n\nk(x, x(cid:48)) \u2192 0; (2)\n\nk(x, x(cid:48)) \u2264 \u03b3.\n\n3\n\n\fFigure 1: Expanded region (blue), when the GP-UCB acquisition function argmax is at (1) in\ufb01nity ;\nor (2) at a \ufb01nite location and its function value is larger or equal\n\u03b2t\u03b8 + \u0001/2; or (3) at a \ufb01nite location\nand its function value is smaller than\n\n\u03b2t\u03b8 + \u0001/2.\n\n\u221a\n\n\u221a\n\n4.1 Proposed Expansion Strategy\n\nThe ideal expansion strategy should satisfy two characteristics: 1) The algorithm can reach the global\n\u0001-accuracy condition in one expansion, and, 2) the search space should be minimally expanded so\nthat the algorithm does not spend unnecessary evaluations near the search space boundary. Since\nwe have a black-box objective function, it is not possible to compute the ideal expansion space\nSideal directly. Let the exploration-exploitation parameters {\u03b2t} be chosen to ensure the objective\nfunction is upper bounded by the GP-UCB acquisition function with high probability. Then we\ncan estimate Sideal by a region S as a minimal region that contains at least one point whose\nacquisition function value is within \u0001 from the acquisition function maximum, i.e. \u2203xu \u2208 S :\n|\u03b1U CB(xu;Dt\u22121)\u2212maxx\u2208Rd \u03b1U CB(x;Dt\u22121)| \u2264 \u0001. Due to the approximation, there is no guarantee\nwe can achieve the global \u0001-accuracy in one expansion. Thus we need multiple expansions sequential.\nA new expansion is triggered when the local \u0001-accuracy is satis\ufb01ed in the previous expansion. In the\nfollowing, we \ufb01rst derive the value of the GP-UCB acquisition function when x \u2192 \u221e (Proposition\n4.1), and then use this value to derive analytical expressions for the size of the expansion space S\n(Theorem 4.1) and when to trigger a new expansion.\nProposition 4.1 When x \u2192 \u221e, the GP-UCB acquisition function \u03b1U CB(x;Dt\u22121) \u2192 \u221a\n\u03b2t\u03b8, where\n\u03b2t is the exploration-exploitation parameter of the GP-UCB acquisition function and \u03b8 is the scale\nfactor of the kernel function k(x, x(cid:48)).\nDerivation of the expansion search space Our idea is to choose the region S such that S =\nRd \\ A, where 1) A contains all the points x that are far from all the current observations, and, 2)\nA := {x \u2208 Rd : |\u03b1U CB(x;Dt\u22121) \u2212 \u221a\n\u03b2t\u03b8| < \u0001/2}. Here, we will show that with this choice of S,\nthere exists at least one point in S whose acquisition function value is within \u0001 from the acquisition\nfunction maximum, given \u0001 < |\u221a\n\u03b2t\u03b8 \u2212 minx\u2208Rd (\u03b1U CB(x;Dt\u22121))|. We consider three cases that\n\ncan happen to the GP-UCB acquisition function (See Figure 1):\n\n\u221a\n\nthe GP-UCB acquisition function maximum is equal to\n\n\u2022 Case 1: The argmax of the GP-UCB acquisition function is at in\ufb01nity. This means that\nfunction is continuous and \u0001 < |\u221a\n\u03b2t\u03b8. As the GP-UCB acquisition\n\u03b2t\u03b8 \u2212 minx\u2208Rd (\u03b1U CB(x;Dt\u22121))|, hence, there exists a\n\u221a\n\u03b2t\u03b8 \u2212 \u0001/2. By the de\ufb01nition of S, it is straightforward\npoint xu such that \u03b1U CB(xu) =\nthat xu belongs to S, thus proving that there exists a point in S whose GP-UCB acquisition\nfunction value is within \u0001 from the maximum of the acquisition function.\n\n\u2022 Case 2: The argmax of the GP-UCB acquisition function x(cid:48)\n\nmax is at a \ufb01nite location\nand its acquisition function value is larger or equal\n\u03b2t\u03b8 + \u0001/2. It is straightforward\nmax belongs to the region S and this is the point that satis\ufb01es\nto see that the argmax x(cid:48)\n|\u03b1U CB(x(cid:48)\nmax;Dt\u22121) \u2212 maxx\u2208Rd \u03b1U CB(x;Dt\u22121)| \u2264 \u0001.\n\u2022 Case 3: The GP-UCB acquisition function argmax is at a \ufb01nite location and the acquisition\ncontinuous and \u0001 < |\u221a\n\u03b2t\u03b8 + \u0001/2. As the GP-UCB acquisition function is\nfunction maximum is smaller than\n\u03b2t\u03b8 \u2212 minx\u2208Rd (\u03b1U CB(x;Dt\u22121))|, there exists a point xu \u2208 S :\n\u221a\n\u03b2t\u03b8 \u2212 \u0001/2. As maxx\u2208Rd \u03b1U CB(x;Dt\u22121) <\n\u03b1U CB(xu;Dt\u22121) =\n\u03b2t\u03b8 + \u0001/2, it follows\ndirectly that |\u03b1U CB(xu;Dt\u22121) \u2212 maxx\u2208Rd \u03b1U CB(x;Dt\u22121)| \u2264 \u0001.\n\n\u221a\n\n\u221a\n\n\u221a\n\nTheorem 4.1 now formally derives an analytical expression for one way to de\ufb01ne region S.\n\n4\n\n\finitial search space Suser, function f, positive small threshold \u0001, evaluation budget T .\n\nAlgorithm 1 Bayesian optimization with unknown search space (GPUCB-UBO)\n1: Input: Gaussian Process (GP) M, acquisition functions \u03b1U CB, \u03b1LCB, initial observations Dinit,\n2: Output: Point x\u0001 : max f (x) \u2212 f (x\u0001) \u2264 \u0001.\n3: Initialize D0 = Dinit, S = Suser, \u03b21, tk = 0. Update the GP using D0.\n4: for t = 1, 2, . . . , T do\nSet tlocal = t \u2212 tk\n5:\nCompute xm = argmaxx\u2208S \u03b1U CB(x;Dt\u22121)\n6:\nSet xt = xm, yt = f (xt). Update Dt = Dt\u22121 \u222a (xt, yt).\n7:\n/\u2217 Compute the expansion trigger, the regret upper bound \u2217/\n8:\nCompute rb = \u03b1U CB(xt;Dt\u22121) \u2212 maxx\u2208Dt \u03b1LCB(x;Dt\u22121) + 1/t2\n9:\n/\u2217 If expansion triggered, expand the search space \u2217/\n10:\nif (rb <= \u0001) | (t == 1) then\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18: end for\n\nend if\n/\u2217 Adjust the \u03b2t based on the search space \u2217/\nCompute \u03b2t following Theorem 5.1\nUpdate the GP using Dt.\n\nCompute the new search space S as de\ufb01ned in Theorem 4.1\nSet tk = tk + tlocal\n\nlocal\n\ni=1\n\n\u221a\n\nzj\u22640 \u2212zj,(cid:80)\n\n\u03b2t, 0.25\u0001/max((cid:80)\n\n\u221a\n\u03b2t\u03b8\u0001/2 \u2212 \u00012/16)/(|Dt\u22121|\u03bbmax)/\n\nS = (cid:83)|Dt\u22121|\nTheorem 4.1 Consider the GP-UCB acquisition function \u03b1U CB(x;Dt\u22121). Let us de\ufb01ne the region\nd\u0001 = gk(min((cid:112)(\nSi, Si = {x : (cid:107)x \u2212 xi(cid:107)2 \u2264 d\u0001}, xi \u2208 Dt\u22121, |Dt\u22121| is the cardinality of Dt\u22121,\nzj\u22650 zj)))\nwith gk(.) as in Eq. (5), \u03bbmax be the largest singular value of (K + \u03c32I)\u22121, and zj be the jth\nelement of (K + \u03c32I)\u22121y. Given \u0001 < |\u221a\n\u03b2t\u03b8 \u2212 minx\u2208Rd (\u03b1U CB(x;Dt\u22121))|, then there exists at least\none point in S whose acquisition function value is within \u0001 from the acquisition function maximum,\ni.e. \u2203xu \u2208 S : |\u03b1U CB(xu;Dt\u22121) \u2212 maxx\u2208Rd \u03b1U CB(x;Dt\u22121)| \u2264 \u0001.\nAcquisition function adaption Let us denote Sk as the kth expansion search space (k \u2265 1).\nIn each Sk, the parameter {\u03b2t} of the GP-UCB acquisition function needs to be valid to ensure\nthe algorithm achieves the local \u0001-accuracy condition. Hence, a new {\u03b2t} is adjusted after each\nexpansion. Details on how to compute the new {\u03b2t} are in Theorem 5.1.\nTriggering the next expansion To guarantee the global \u0001-accuracy condition, in each search space\nSk, we aim to \ufb01nd an iteration Tk which satis\ufb01es rSk (Tk) = (maxx\u2208Sk f (x)\u2212maxxi\u2208DTk\nf (xi)) \u2264\n\u0001 before the next expansion. As we do not have maxx\u2208Sk f (x) and {f (xi)}, we bound rSk (t) by\nrb,Sk (t) = maxx\u2208Sk \u03b1U CB(x;Dt\u22121)+1/t2\u2212maxx\u2208Dt \u03b1LCB(x;Dt\u22121), where \u03b1LCB(x;Dt\u22121) =\n\u00b5t\u22121(x) \u2212 \u221a\nSearch space optimization The theoretical search space developed in Theorem 4.1 is the union of\n|Dt\u22121| balls. To suit optimizer input, this region is converted to an encompassing hypercube using,\n(6)\n\n\u03b2t\u03c3t\u22121(x). The next expansion is triggered when rb,Sk (t) reaches \u0001.\n\nminxi\u2208Dt\u22121(xk\n\ni ) \u2212 d\u0001 \u2264 xk \u2264 maxxi\u2208Dt\u22121(xk\n\ni ) + d\u0001, k = 1, d.\n\nFurther re\ufb01nement of the implementation is provided in the supplementary material.\nAlgorithm 1 describes the proposed Bayesian optimization with unknown search space algorithm.\n\n5 Theoretical Analysis\n\nFirst, to ensure the validity of our algorithm, we prove that for a wide range of kernels, for any search\nspace Sk and any positive \u0001, with a proper choice of {\u03b2t}, our trigger for expansion condition occurs\nwith high probability. When this happens, the algorithm achieves the local \u0001-accuracy condition.\nProposition 5.1 For any d-dimensional domain Sk with side length rk, for the kernel classes: \ufb01nite\ndimensional linear, Squared Exponential and Mat\u00e9rn, suppose the kernel k(x, x(cid:48)) satis\ufb01es the follow-\ning condition on the derivatives of GP sample paths f: \u2203ak, bk > 0, Pr{supx\u2208Sk\n|\u2202f /\u2202xj| >\n\n5\n\n\f(cid:112)log(4dak/\u03b4)), then \u2200\u0001 > 0, with probability larger than 1 \u2212 \u03b4, there \u2203Tk : \u2200t \u2265\n\nL} \u2264 ak exp\u2212(L/bk)2\n, j = 1, d. Pick \u03b4 \u2208 (0, 1), and de\ufb01ne \u03b2t = 2 log(t22\u03c02/(3\u03b4)) +\n2d log(t2dbkrk\nTk, maxx\u2208Sk \u03b1U CB(x;Dt\u22121) \u2212 maxx\u2208Dt \u03b1LCB(x;Dt\u22121) \u2264 \u0001 \u2212 1/t2; and \u2200t that satis\ufb01es the\nprevious condition, maxx\u2208Sk f (x) \u2212 maxx\u2208Dt f (x) \u2264 \u0001.\nSecond, we prove that with a proper choice of {\u03b2t} and for a wide range class of kernels, after a \ufb01nite\nnumber of iterations, our algorithm achieves the global \u0001-accuracy condition with high probability.\nTheorem 5.1 Denote {Sk} as the series of the expansion search space suggested by our algorithm\n(k \u2265 1). In each Sk, let Tk be the smallest number of iterations that satis\ufb01es our expansion triggered\ncondition, i.e. rb,Sk (Tk) \u2264 \u0001. Suppose the kernel k(x, x(cid:48)) belong to the kernel classes listed in\nde\ufb01ne, \u03b2t = 2 log((t \u2212(cid:80)\nProposition 5.1 and it satis\ufb01es the following condition on the derivatives of GP sample paths f:\n, j = 1, d. Pick \u03b4 \u2208 (0, 1), and\n\u2203ak, bk > 0, Pr{supx\u2208Sk\n(cid:80)\nj\u2264k\u22121 Tj + 1 \u2264 t \u2264(cid:80)\nj\u2264k\u22121 Tj)2dbkrk\n\nj\u2264k Tj, k = 1, 2, .... Then running the proposed algorithm with the above\nchoice of \u03b2t for a sample f of a GP with mean function zero and covariance function k(x, x(cid:48)), after\na \ufb01nite number of iterations, we achieve global \u0001-accuracy with at least 1 \u2212 \u03b4 probability, i.e.\n\nPr{f (xmax) \u2212 f (xsuggest) \u2264 \u0001} \u2265 1 \u2212 \u03b4,\n\n|\u2202f /\u2202xj| > L} \u2264 ak exp\u2212(L/bk)2\n\nj\u2264k\u22121 Tj)22\u03c02/(3\u03b4)) + 2d log((t \u2212(cid:80)\n\n(cid:112)log(4dak/\u03b4)),\n\nwhere xsuggest is the algorithm recommendation and xmax is the objective function global argmax.\n\nDiscussion The difference between our method and previous works is that we guarantee the local\n\u0001-accuracy condition in each search space, eventually achieving the global \u0001-accuracy. Previous\nmethods do not give this guarantee, and thus their \ufb01nal solution may not reach global \u0001-accuracy.\n\n6 Experimental Evaluation\nWe evaluate our method on \ufb01ve synthetic benchmark functions and three hyperparameter tuning tasks\nfor common machine learning models. For problems with dimension d, the optimization evaluation\nbudget is 10d (excluding initial 3d points following a latin hypercube sampling [8]). The experiments\nwere repeated 30 and 20 times for the synthetic functions and machine learning hyperparameter\ntuning tasks respectively. For all algorithms, the Squared Exponential kernel is used, the GP models\nare \ufb01tted using the Maximum Likelihood method and the output observations {yi} are normalized\nyi \u223c N (0, 1). As with previous GP-based algorithms that use con\ufb01dence bounds [3, 19], our\ntheoretical choice of {\u03b2t} in Theorem 5.1 is typically overly conservative. Hence, following the\nsuggestion in [19], for any algorithms that use the GP-UCB acquisition, we scale \u03b2t down by a factor\nof 5. Finally, for the synthetic functions, \u0001 is set at 0.05 whist for the machine learning models, \u0001 is\nset at 0.02 as we require higher accuracy in these cases.\nWe compare our proposed method, GPUCB-UBO, with seven baselines: (1) EI-Vanilla: the vanilla\nExpected Improvement (EI); (2) EI-Volx2: the EI with the search space volume doubled every 3d\niterations [16]; (3) EI-H: the Regularized EI with a hinge-quadratic prior mean where \u03b2 = 1 and R\nis the circumradius of the initial search space [16]; (4) EI-Q: the Regularized EI with a quadratic\nprior mean where the widths w are set to those of the initial search space [16]; (5) GPUCB-Vanilla:\nthe vanilla GP-UCB; (6) GPUCB-Volx2: the GP-UCB with the search space volume doubled every\n3d iterations [16]; (7) GPUCB-FBO: the GP-UCB with the \ufb01tering expansion strategy in [12].\n\n6.1 Visualization\n\nWe visualize our theoretical expansion search spaces derived in Theorem 4.1 on the Beale test function\n(Figure 2). We show the contour plots of the GP-UCB acquisition functions, and show both the\nobservations (red stars) and the recommendation from the algorithm that correspond the acquisition\nfunction maximum (cyan stars). The initial user-de\ufb01ned search space (black rectangle) is expanded\nas per theoretical search spaces developed in Theorem 4.1 (yellow rectangles). Here we use Eq. (6)\nto plot the expansion search spaces, however, the spaces developed in Theorem 4.1 are tighter. The\n\ufb01gure illustrates that when the argmax of the objective function is outside of the user-de\ufb01ned search\nspace, with our search space expansion strategy, this argmax can be located within a \ufb01nite number of\nexpansions.\n\n6\n\n\fFigure 2: Expansion search spaces using Theorem 4.1 for Beale function in two cases when the\nglobal \u0001-accuracy is achieved within (a) one expansion; or (b) two expansions. The black rectangle is\nthe user-de\ufb01ned search space and the yellow rectangles are the theoretical expansion search spaces.\nThe contour plots of the acquisition function are also displayed with observations (red stars) and the\nrecommendation at that iteration (cyan star). Global optimum of Beale function is the magenta star.\n\n6.2 Synthetic Benchmarks\n\nWe compare our method with seven baselines on \ufb01ve benchmark test functions: Beale, Eggholder,\nLevy 3, Hartman 3 and Hartman 6. We use the same experiment setup as in [16]. The length of the\ninitial user-de\ufb01ned search space is set to be 20% of the length of the function domain - e.g. if the\nfunction domain is the unit hypercube [0, 1]d, then the initial search space has side length of 0.2. The\ncenter of this initial search space is placed randomly in the domain of the objective function.\nFor each test function and algorithm, we run the experiment 30 times, and each time the initial\nsearch space will be placed differently. We plot the mean and the standard error of the best found\nvalues maxi=1,n f (xi) of each test function. Figure 3 shows that for most test functions, our method\nGPUCB-UBO achieves both better function values and in less iterations than other methods. For most\ntest functions, our method is better than other six state-of-the-art approaches (except GPUCB-FBO)\nby a high margin. Compared with GPUCB-FBO, our method is better on the test functions Hartman3\nand Hartman6 while performing similar on other three test functions. Note that the computation time\nof GPUCB-FBO is 2-3 times slower than our method and other approaches (see Table 1) because it\nneeds an extra step to numerically solve several optimization problems to construct the new search\nspace. Since we derive the expansion search spaces analytically, our method, in contrast, can optimize\nthe acquisition function within these spaces without any additional computation.\n\nFigure 3: Best found values of various synthetic benchmark test functions using different algorithms.\nPlotting mean and standard error over 30 repetitions. (Best seen in color)\n\n7\n\n\fTable 1: The average runtime (seconds) of selecting the next input of different methods. All the time\nmeasurements were taken when evaluating the methods on a Ubuntu 18.04.2 server with Intel Xeon\nCPU E5-2670 2.60GHz 128GB RAM. All the source codes are written in Python 3.6.\n\nBeale\n\nMETHODS\nGPUCB-UBO 2.8 \u00b1 0.2\n3.4 \u00b1 0.2\nEIH\n5.6 \u00b1 0.4\nEIQ\n3.2 \u00b1 0.2\nEI-Vol2\n3.5 \u00b1 0.4\nGPUCB-Vol2\nGPUCB-FBO 5.6 \u00b1 0.4\n\nEggholder Hartman3\n2.8 \u00b1 0.3\n3.1 \u00b1 0.5\n1.2 \u00b1 0.03\n1.0 \u00b1 0.01\n3.3 \u00b1 0.03\n2.9 \u00b1 0.02\n0.9 \u00b1 0.01\n1.2 \u00b1 0.1\n1.6 \u00b1 0.05\n9.4 \u00b1 0.7\n8.3 \u00b1 1.1\n5.4 \u00b1 0.2\n\nLevy3\n3.7 \u00b1 0.5\n4.9 \u00b1 0.2\n5.8 \u00b1 0.3\n5.1 \u00b1 0.2\n2.9 \u00b1 0.1\n8.6 \u00b1 0.3\n\nHartman6\n5.0 \u00b1 0.9\n1.4 \u00b1 0.02\n5.7 \u00b1 0.1\n1.7 \u00b1 0.1\n12.0 \u00b1 1.1\n18.8 \u00b1 2.9\n\n6.3 Hyperparameter Tuning for Machine Learning Models\n\nNext we apply our method on hyperparameter tuning of three machine learning models on the MNIST\ndataset: elastic net, multilayer perceptron and convolutional neural network. With each model, the\nexperiments are repeated 20 times and each time the initial search space will be placed differently.\nElastic Net Elastic net is a regularized regression method that utilizes the L1 and L2 regularizers. In\nthe model, the hyperparameter \u03b1 > 0 determines the magnitude of the penalty and the hyperparameter\nl (0 \u2264 l \u2264 1) balances between the L1 and L2 regularizers. We tune \u03b1 in the normal space while l is\ntuned in an exponent space (base 10). The initial search space of \u03b1 and l is randomly placed in the\ndomain [\u22123,\u22121] \u00d7 [0, 1] with side length to be 20% of the domain size length. We implement the\nElastic net model using the function SGDClassi\ufb01er in the scikit-learn package [13].\nMultilayer Perceptron (MLP) We construct a 2-layer MLP with 512 neurons/layer. We optimize\nthree hypeparameters: the learning rate l and the L2 norm regularization hyperparameters lr1 and\nlr2 of the two layers. All the hyperparameters are tuned in the exponent space (base 10). The initial\nsearch space is a randomly placed unit cube in the cube [\u22126,\u22121]3. The model is implemented using\ntensor\ufb02ow. The model is trained with the Adam optimizer in 20 epochs and the batch size is 128.\nConvolutional Neural Network (CNN) We consider a CNN with two convolutional layers. The\nCNN architecture (e.g. the number of \ufb01lters, the \ufb01lter shape, etc.) is chosen as the standard architec-\nture published on the of\ufb01cial GitHub repository of tensor\ufb02ow 1. We optimize three hyperparameters:\nthe learning rate l and the dropout rates rd1, rd2 in the pooling layers 1 and 2. We tune rd1, rd2 in the\nnormal space while l is tuned in an exponent space (base 10). The initial search space of rd1, rd2, l is\nrandomly placed in the domain [0, 1] \u00d7 [0, 1] \u00d7 [\u22125,\u22121] with side length to be 20% of this domain\nsize length. The network is trained with the Adam optimizer in 20 epochs and the batch size is 128.\n\nFigure 4: Prediction accuracy of different machine learning models on MNIST dataset using different\nalgorithms. Mean and standard error over 20 repetitions are shown. (Best seen in color)\n\nGiven a set of hyperparameters, we train the models with this hyperparameter setting using the\nMNIST train dataset (55000 patterns) and then test the model on the MNIST test dataset (10000\npatterns). Bayesian optimization method then suggests a new hyperparameter setting based on the\n\n1https://github.com/tensor\ufb02ow/tensor\ufb02ow\n\n8\n\n\fprediction accuracy on the test dataset. This process is conducted iteratively until the evaluation\nbudget (10d evaluations) is depleted. We plot the prediction accuracy in Figure 4. For the Elastic\nnet model, our method GPUCB-UBO performs similar to GPUCB-FBO while outperforming the\nother six approaches signi\ufb01cantly. For the MLP model, GPUCB-UBO performs far better than other\napproaches. To be speci\ufb01c, after only 12 iterations, it achieves a prediction accuracy of 97.8% whilst\nother approaches take more than 24 iterations to get to this level. For the CNN model, GPUCB-UBO\nalso outperforms other approaches by a high margin. After 30 iterations, it can provide a CNN model\nwith prediction accuracy of 98.7%.\n\n7 Conclusion\n\nWe propose a novel Bayesian optimization framework when the search space is unknown. We\nguarantee that in iterative expansions of the search space, our method can \ufb01nd a point whose function\nvalue within \u0001 of the objective function maximum. Without the need to specify any parameters,\nour algorithm automatically triggers a minimal expansion required iteratively. We demonstrate our\nmethod on both synthetic benchmark functions and machine learning hyper-parameter tuning tasks\nand demonstrate that our method outperforms state-of-the-art approaches.\nOur source code is publicly available at https://github.com/HuongHa12/BO_unknown_searchspace.\n\nAcknowledgments\n\nThis research was partially funded by the Australian Government through the Australian Re-\nsearch Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship\n(FL170100006).\n\nReferences\n[1] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3:397\u2013422, 2003.\n\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, 2002.\n\n[3] I. Bogunovic, J. Scarlett, A. Krause, and V. Cevher. Truncated variance reduction: A uni\ufb01ed\napproach to bayesian optimization and level-set estimation. In Proceedings of the 30th Interna-\ntional Conference on Neural Information Processing Systems (NIPS), pages 1515\u20131523, USA,\n2016.\n\n[4] A.D. Bull. Convergence rates of ef\ufb01cient global optimization algorithms. Journal of Machine\n\nLearning Research, 12:2879\u20132904, 2011.\n\n[5] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback. In\n\nCOLT, 2008.\n\n[6] P. Hennig and C.J. Schuler. Entropy search for information-ef\ufb01cient global optimization.\n\nJournal of Machine Learning Research, 13(1):1809\u20131837, 2012.\n\n[7] J.M. Henr\u00e1ndez-Lobato, M.W. Hoffman, and Z. Ghahramani. Predictive entropy search for\nIn Advances in Neural Information\n\nef\ufb01cient global optimization of black-box functions.\nProcessing Systems (NIPS), pages 918\u2013926, 2014.\n\n[8] D.R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal\n\nof Global Optimization, 21(4):345\u2013383, 2001.\n\n[9] D.R. Jones, M. Schonlau, and W.J. Welch. Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global Optimization, 13(4):455\u2013492, December 1998.\n\n[10] H.J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in\n\nthe presence of noise. Journal of Basic Engineering, 86(1):97\u2013106, 1964.\n\n9\n\n\f[11] J. Mo\u02d8ckus, V. Tiesis, and A. \u02d8Zilinskas. The application of Bayesian methods for seeking the\n\nextremum, volume 2 of Toward Global Optimization. Elsevier, 1978.\n\n[12] V. Nguyen, S. Gupta, S. Rane, C. Li, and S. Venkatesh. Bayesian optimization in weakly\nspeci\ufb01ed search space. In 2017 IEEE International Conference on Data Mining (ICDM), pages\n347\u2013356, 2017.\n\n[13] F. Pedregosa and G. Varoquaux et al. Scikit-learn: Machine learning in python. The Journal of\n\nMachine Learning Research, 12:2825\u20132830, 2011.\n\n[14] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT\n\nPress, 2006.\n\n[15] J. Scarlett. Tight regret bounds for Bayesian optimization in one dimension. In Proceed-\nings of the 35th International Conference on Machine Learning (ICML), pages 4500\u20134508,\nStockholmsm\u00e4ssan, Stockholm, Sweden, 2018.\n\n[16] B. Shahriari, A. Bouchard-Cote, and N. De Freitas. Unbounded bayesian optimization via\nregularization. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), volume 51, pages 1168\u20131176, 2016.\n\n[17] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of\nthe loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148\u2013175, 2016.\n\n[18] J. Snoek, H. Larochelle, and R.P Adams. Practical bayesian optimization of machine learn-\ning algorithms. In Proceedings of the 25th International Conference on Neural Information\nProcessing Systems - Volume 2 (NIPS), NIPS\u201912, pages 2951\u20132959, USA, 2012.\n\n[19] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger. Gaussian process optimization in the\nbandit setting: No regret and experimental design. In Proceedings of the 27th International\nConference on International Conference on Machine Learning (ICML), pages 1015\u20131022, 2010.\n\n[20] Z. Wang and S. Jegelka. Max-value entropy search for ef\ufb01cient bayesian optimization. In\nProceedings of the 34th International Conference on Machine Learning (ICML), pages 3627\u2013\n3635, 2017.\n\n10\n\n\f", "award": [], "sourceid": 6304, "authors": [{"given_name": "Huong", "family_name": "Ha", "institution": "Deakin University"}, {"given_name": "Santu", "family_name": "Rana", "institution": "Deakin University"}, {"given_name": "Sunil", "family_name": "Gupta", "institution": "Deakin University"}, {"given_name": "Thanh", "family_name": "Nguyen", "institution": "Deakin University"}, {"given_name": "Hung", "family_name": "Tran-The", "institution": "Deakin University"}, {"given_name": "Svetha", "family_name": "Venkatesh", "institution": "Deakin University"}]}