{"title": "Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness", "book": "Advances in Neural Information Processing Systems", "page_first": 783, "page_last": 791, "abstract": null, "full_text": "Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness\n\nR mi Munos e SequeL project, INRIA Lille Nord Europe, France remi.munos@inria.fr\n\nAbstract\nWe consider a global optimization problem of a deterministic function f in a semimetric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric . We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of . We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semimetric under which f is smooth, and whose performance is almost as good as DOO optimally-fitted.\n\n1 Introduction\nWe consider the problem of finding a good approximation of the maximum of a function f : X R using a finite budget of evaluations of the function. More precisely, we want to design a sequential exploration strategy of the search space X , i.e. a sequence x1 , x2 , . . . , xn of states of X , where each xt may depend on previously observed values f (x1 ), . . . , f (xt-1 ), such that at round n (computational budget), the algorithms A returns a state x(n) with highest possible value. The performance of the algorithm is evaluated by the loss rn = sup f (x) - f (x(n)).\nxX\n\n(1)\n\nHere the performance criterion is the accuracy of the recommendation made after n evaluations to the function (which may be thought of as calls to a black-box model). This criterion is different from usual bandit settings where the cumulative regret (n supxX f (x) - n f (x(t))) measures how t=1 well the algorithm succeeds in selecting states with good values while exploring the search space. The loss criterion (1) is closer to the simple regret defined in the bandit setting [BMS09, ABM10]. Since the literature on global optimization is huge, we only mention the works that are closely related to our contribution. The approach followed here can be seen as an optimistic sampling strategy where, at each round, we explore the space where the function could be the largest, given the knowledge of previous evaluations. A large body of algorithmic work has been developed using branch-and-bound techniques [Neu90, Han92, Kea96, HT96, Pin96, Flo99, SS00], such as Lipschitz optimization where the function is assumed to be globally Lipschitz. Our first contribution with respect to (w.r.t.) this literature is to considerably weaken the Lipschitz assumption usually made and consider only a locally one-sided Lipschitz assumption around the maximum of f . In addition, we do not require the space to be a metric space but only to be equipped with a semi-metric. The optimistic strategy has been recently intensively studied in the bandit literature, such as in the UCB algorithm [ACBF02] and the many extensions to tree search [KS06, CM07] (with application 1\n\n\fto computer-go [GWMT06]), planning [HM08, BM10, BMSB11], and Gaussian process optimization [SKKS10]. The case of Lipschitz (or relaxed) assumption in a metric spaces is considered in [Kle04, AOS07] and more recently in [KSU08, BMSS08, BMSS11], and in the case of unknown Lipschitz constant, see [BSY11, Sli11] (where they assume a bound on the Hessian or another related parameter). Compared to this literature, our contribution is the design and analysis of two algorithms: (1) A first algorithm, Deterministic Optimistic Optimization (DOO), that requires the knowledge of the semimetric for which f is locally smooth around its maximum. A loss bound is provided (in terms of the near-optimality dimension of f under ) in a more general setting that previously considered. (2) A second algorithm, Simultaneous Optimistic Optimization (SOO), that does not require the knowledge of . We show that SOO performs almost as well as DOO optimally-fitted.\n\n2 Assumptions about the hierarchical partition and the function\nOur optimization algorithms will be implemented by resorting to a hierarchical partitioning of the space X , which is given to the algorithms. More precisely, we consider a set of partitions of X at all scales h 0: for any integer h, X is partitioned into a set of K h sets Xh,i (called cells), where 0 i K h - 1. This partitioning may be represented by a K-ary tree structure where each cell Xh,i corresponds to a node (h, i) of the tree (indexed by its depth h and index i), and such that each node (h, i) possesses K children nodes {(h + 1, ik )}1kK . In addition, the cells of the children {Xh+1,ik , 1 k K} form a partition of the parent's cell Xh,i . The root of the tree corresponds to the whole domain X (cell X0,0 ). To each cell Xh,i is assigned a specific state xh,i Xh,i where f may be evaluated. We now state 4 assumptions: Assumptions 1 is about the semi-metric , Assumption 2 is about the smoothness of the function w.r.t. , and Assumptions 3 and 4 are about the shape of the hierarchical partition w.r.t. . Assumption 1 (Semi-metric). We assume that : X X R+ is such that for all x, y X , we have (x, y) = (y, x) and (x, y) = 0 if and only if x = y. Note that we do not require that satisfies the triangle inequality (in which case, would be a metric). An example of a metric space is the Euclidean space Rd with the metric (x, y) = x - y (Euclidean norm). Now consider Rd with (x, y) = x - y , for some > 0. When 1, then is also a metric, but whenever > 1 then does not satisfy the triangle inequality anymore, and is thus a semi-metric only. Assumption 2 (Local smoothness of f ). There exists at least a global optimizer x X of f (i.e., f (x ) = supxX f (x)) and for all x X , f (x ) - f (x) (x, x ). (2) This condition guarantees that f does not decrease too fast around (at least) one global optimum x (this is a sort of a locally one-sided Lipschitz assumption). Now we state the assumptions about the hierarchical partitions. Assumption 3 (Bounded diameters). There exists a decreasing sequence (h) > 0, such that for any depth h 0, for any cell Xh,i of depth h, we have supxXh,i (xh,i , x) (h).\n\nAssumption 4 (Well-shaped cells). There exists > 0 such that for any depth h 0, any cell Xh,i contains a -ball of radius (h) centered in xh,i .\n\n3 When the semi-metric is known\nIn this Section, we consider the setting where Assumptions 1-4 hold for a specific semi-metric , and that the semi-metric is known from the algorithm. 3.1 The DOO Algorithm The Deterministic Optimistic Optimization (DOO) algorithm described in Figure 1 uses explicitly the knowledge of (through the use of (h)). DOO builds incrementally a tree Tt for t = 1 . . . n, by 2\n\n\fInitialization: T1 = {(0, 0)} (root node) for t = 1 to n do def Select the leaf (h, j) Lt with maximum bh,j = f (xh,j ) + (h) value. Expand this node: add to Tt the K children of (h, j) end for Return x(n) = arg max(h,i)Tn f (xh,i )\n\nFigure 1: Deterministic optimistic optimization (DOO) algorithm. selecting at each round t a leaf of the current tree Tt to expand. Expanding a leaf means adding its K children to the current tree (this corresponds to splitting the cell Xh,j into K sub-cells). We start with the root node T1 = {(0, 0)}. We write Lt the leaves of Tt (set of nodes whose children are not in Tt ), which are the set of nodes that can be expanded at round t. This algorithm is called optimistic because it expands at each round a cell that may contain the optimum of f , based on the information about (i) the previously observed evaluations of f , and (ii) the knowledge of the local smoothness property (2) of f (since is known). The algorithm computes def the b-values bh,j = f (xh,j ) + (h) of all nodes (h, j) of the current tree Tt and select the leaf with highest b-value to expand next. It returns the state x(n) with highest evaluation. 3.2 Analysis of DOO Note that Assumption 2 implies that the b-value of any cell containing x upper bounds f , i.e., for any cell Xh,i such that x Xh,i , bh,i = f (xh,i ) + (h) f (xh,i ) + (xh,i , x ) f . As a consequence, a node (h, i) such that f (xh,i ) + (h) < f will never be expanded (since at any time t, the b-value of such a node will be dominated by the b-value of the leaf containing x ). We def deduce that DOO only expands nodes of the set I = h0 Ih , where Ih = {nodes (h, i) such that f (xh,i ) + (h) f }. In order to derive a loss bound we now define a measure of the quantity of near-optimal states, called near-optimality dimension. This measure is closely related to similar measures introduced def in [KSU08, BMSS08]. For any > 0, let us write X = {x X , f (x) f - } the set of -optimal states. Definition 1 (Near-optimality dimension). The near-optimality dimension is the smallest d 0 such that there exists C > 0 such that for any > 0, the maximal number of disjoint -balls of radius and center in X is less than C-d . Note that d is not an intrinsic property of f : it characterizes both f and (since we use -balls in the packing of near-optimal states), and also depend on . We now bound the number of nodes in Ih . Lemma 1. We have |Ih | C(h)-d . Proof. From Assumption 4, each cell (h, i) contains a ball of radius (h) centered in xh,i , thus if |Ih | = |{xh,i X(h) }| exceeded C(h)-d , this would mean that there exists more than C(h)-d disjoint -balls of radius (h) with center in X(h) , which contradicts the definition of d. We now provide our loss bound for DOO. Theorem 1. Let us write h(n) the smallest integer h such that C of DOO is bounded as rn (h(n)). 3\nh l=0 def\n\n(l)-d n. Then the loss\n\n\fProof. Let (hmax , j) be the deepest node that has been expanded by the algorithm up to round n. We known that DOO only expands nodes in the set I. Now, among all node expansion strategies of the set of expandable nodes I, the uniform strategy is the one which minimizes the depth of the resulting tree. From the definition of h(n) and from Lemma 1, we have thus the maximum depth of the uniform strategy is at least h(n), and we deduce that hmax h(n). Now since node (hmax , j) has been expanded, we have that (hmax , j) I, thus f (x(n)) f (xhmax ,j ) f - (hmax ) f - (h(n)). Remark 1. This bound is in terms of the number of expanded nodes n. The actual number of function evaluations is Kn (since each expansion generates K children that need to be evaluated). Now, let us make the bound more explicit when the diameter (h) of the cells decreases exponentially fast with their depth (this case is rather general as illustrated in the examples described next, as well as in the discussion in [BMSS11]). Corollary 1. Assume that (h) = c h for some constants c > 0 and < 1. If the near-optimality d+1 -1/d 1/d -1/d C n . Now, of f is d > 0, then the loss decreases polynomially fast: rn c d 1 - d if d = 0, then the loss decreases exponentially fast: rn c (n/C)-1 . Proof. From Theorem 1, whenever d > 0 we have n C\n-dh(n) d -1/d n cC 1/d -1/d d h(n) l=0 h(n) l=0 h(n)-1 |Il | l=0\n\nC\n\nh(n)-1 l=0\n\n(l)-d < n,\n\n(l)-d = c C \nh(n)\n\n-d(h(n)+1)\n\nthus \n\n C n . Now, if d = 0 then n C the loss is bounded as rn (h(n)) = c (n/C)-1 . 3.3 Examples\n\n1 - , from which we deduce that rn (h(n)) c (l)\n-d\n\n= C(h(n) + 1), and we deduce that\n\n c\n\n -d -1\nd+1 d\n\n-1\n\n,\n\n1-\n\nExample 1: Let X = [-1, 1]D and f be the function f (x) = 1 - x , for some 1. Consider a K = 2D -ary tree of partitions with (hyper)-squares. Expanding a node means splitting the corresponding square in 2D squares of half length. Let xh,i be the center of Xh,i . Consider the following choice of the semi metric: (x, y) = x - y , with . We have (h) = 2-h (recall that (h) is defined in terms of ), and = 1. The optimum of f is x = 0 and f satisfies the local smoothness property (2). Now let us compute its near-optimality dimension. 1/ D For any > 0, X is the L -ball of radius 1/ centered in 0, which can be packed by 1/ L -balls of diameter (since a L -balls of diameter is a -ball of diameter 1/ ). Thus the nearoptimality dimension is d = D(1/ - 1/) (and the constant C = 1). From Corollary 1 we deduce 1 that (i) when > , then d > 0 and in this case, rn = O n- D - . And (ii) when = , then d = 0 and the loss decreases exponentially fast: rn 21-n . It is interesting to compare this result to a uniform sampling strategy (i.e., the function is evaluated at the set of points on a uniform grid), which would provide a loss of order n-/D . We observe that DOO is better than uniform whenever < 2 and worse when > 2. This result provides some indication on how to choose the semi-metric (thus ), which is a key ingredient of the DOO algorithm (since (h) = 2-h appears in the b-values): should be as close as possible to the true (but unknown) (which can be seen as a local smoothness order of f around its maximum), but never larger than (otherwise f does not satisfy the local smoothness property (2)). Example 2: The previous analysis generalizes to any function which is locally equivalent to x - x , for some > 0 (where is any norm, e.g., Euclidean, L , or L1 ), around a global maximum x (among a set of global optima assumed to be finite). That is, we assume that there exists constants c1 > 0, c2 > 0, > 0, such that f (x ) - f (x) f (x ) - f (x) c1 x - x c2 x - x 4\n \n\n, ,\n\nfor all x X , for all x - x .\n\n\fLet X = [0, 1]D . Again, consider a K = 2D -ary tree of partitions with (hyper)-squares. Let (x, y) = c x - y with c1 c and (so that f satisfies (2)). For simplicity we do not make explicit all the constants using the O notation for convenience (the actual constants depend on the choice of the norm ). We have (h) = O(2-h ). Now, let us compute the nearoptimality dimension. For any > 0, X is included in a ball of radius (/c2 )1/ centered in x , 1/ D -balls of diameter . Thus the near-optimality dimension is which can be packed by O 1/ d = D(1/ - 1/), and the results of the previous example apply (up to constants), i.e. for > , 1 then d > 0 and rn = O n- D - . And when = , then d = 0 and one obtains the exponential rate rn = O(2-(n/C-1) ). We deduce that the behavior of the algorithm depends on our knowledge of the local smoothness (i.e. and c1 ) of the function around its maximum. Indeed, if this smoothness information is available, then one should defined the semi-metric (which impacts the algorithm through the definition of (h)) to match this smoothness (i.e. set = ) and derive an exponential loss rate. Now if this information is unknown, then one should underestimate the true smoothness (i.e. by choosing 1 ) and suffer a loss rn = O n- D - , rather than overestimating it ( > ) since in this case, (2) may not hold anymore and there is a risk that the algorithm converges to a local optimum (thus suffering a constant loss). 3.4 Comparison with previous works Optimistic planning: The deterministic planning problem described in [HM08] considers an optimistic approach for selecting the first action of a sequence x that maximizes the sum of discounted rewards. We can easily cast their problem in our setting by considering the space X of the set of infinite sequences of actions. The metric (x, y) is h(x,y) /(1 - ), where h(x, y) is the length of the common initial actions between the sequences x and y, and is the discount factor. It is easy to show that the function f (x), defined as the discounted sum of rewards along the sequence x of actions, is Lipschitz w.r.t. and thus satisfies (2). Their algorithm is very close to DOO: it expands a node of the tree (finite sequence of actions) with highest upper-bound on the possible value. Their regret analysis makes use of a quantity of near-optimal sequences, from which they define [1, K] that can be seen as the branching factor of the set of nodes I that can be expanded. This measure is related to our near-optimality dimension by = -d . Corollary 1 implies directly that the loss log 1/ bound is rn = O(n- log ) which is the result reported in [HM08]. HOO and Zooming algorithms: The DOO algorithm can be seen as a deterministic version of the HOO algorithm of [BMSS11] and is also closely related to the Zooming algorithm of [KSU08]. Those works consider the case of noisy evaluations of the function (X -armed bandit setting), which is assumed to be weakly Lipschitz (slightly stronger than our Assumption 2). The bounds reported in those works are (for the case of exponentially decreasing diameters considered in their work d+1 and in our Corollary 1) on the cumulative regret Rn = O(n d+2 ), which translates into the loss 1 considered here as rn = O(n- d+2 ), where d is the near-optimality dimension (or the closely defined zooming dimension). We conclude that a deterministic evaluation of the function enables to obtain a much better polynomial rate O(n-1/d ) when d > 0, and even an exponential rate when d = 0 (Corollary 1). In the next section, we address the problem of an unknown semi-metric , which is the main contribution of the paper.\n\n4 When the semi-metric is unknown\nWe now consider the setting where Assumptions 1-4 hold for some semi-metric , but the semimetric is unknown. The hierarchical partitioning of the space is still given, but since is unknown, one cannot use the diameter (h) of the cells to design upper-bounds, like in DOO. The question we wish to address is: If is unknown, is it possible to implement an optimistic algorithm with performance guarantees? We provide a positive answer to this question and in addition we show that we can be almost as good as an algorithm that would know , for the best possible satisfying Assumptions 1-4. 5\n\n\fThe maximum depth function t hmax (t) is a parameter of the algorithm. Initialization: T1 = {(0, 0)} (root node). Set t = 1. while True do Set vmax = -. for h = 0 to min(depth(Tt ), hmax (t)) do Among all leaves (h, j) Lt of depth h, select (h, i) arg max(h,j)Lt f (xh,j ) if f (xh,i ) vmax then Expand this node: add to Tt the K children (h + 1, ik )1kK Set vmax = f (xh,i ), Set t = t + 1 if t = n then Return x(n) = arg max(h,i)Tn xh,i end if end for end while.\n\nFigure 2: Simultaneous Optimistic Optimization (SOO) algorithm. 4.1 The SOO algorithm The idea is to expand at each round simultaneously all the leaves (h, j) for which there exists a semi-metric such that the corresponding upper-bound f (xh,j ) + supxXh,j (xh,j , x) would be the highest. This is implemented by expanding at each round at most a leaf per depth, and a leaf is expanded only if it has the largest value among all leaves of same or lower depths. The Simultaneous Optimistic Optimization (SOO) algorithm is described in Figure 2. The SOO algorithm takes as parameter a function t hmax (t) which forces the tree to a maximal depth of hmax (t) after t node expansions. Again, Lt refers to the set of leaves of Tt . 4.2 Analysis of SOO All previously relevant quantities such as the diameters (h), the sets Ih , and the near-optimality dimension d depend on the unknown semi-metric (which is such that Assumptions 1-4 are satisfied). At time t, let us write h the depth of the deepest expanded node in the branch containing x (an t optimal branch). Let (h + 1, i ) be an optimal node of depth h + 1 (i.e., such that x Xh +1,i ). t t t Since this node has not been expanded yet, any node (h +1, i) of depth h +1 that is later expanded, t t before (h + 1, i ) is expanded, is (h + 1)-optimal. Indeed, f (xh +1,i ) f (xh +1,i ) f - t t t t (h + 1). We deduce that once an optimal node of depth h is expanded, it takes at most |Ih+1 | node t expansions at depth h + 1 before the optimal node of depth h + 1 is expanded. From that simple observation, we deduce the following lemma. Lemma 2. For any depth 0 h hmax (t), whenever t (|I0 | + |I1 | + + |Ih |)hmax (t), we have h h. t Proof. We prove it by induction. For h = 0, we have h 0 trivially. Assume that the proposition t is true for all 0 h h0 with h0 < hmax (t). Let us prove that it is also true for h0 + 1. Let t (|I0 | + |I1 | + + |Ih0 +1 |)hmax (t). Since t (|I0 | + |I1 | + + |Ih0 |)hmax (t) we know that h h0 . So, either h h0 + 1 in which case the proof is finished, or h = h0 . In this t t t latter case, consider the nodes of depth h0 + 1 that are expanded. We have seen that as long as the optimal node of depth h0 + 1 is not expanded, any node of depth h0 + 1 that is expanded must be (h0 + 1)-optimal, i.e., belongs to Ih0 +1 . Since there are |Ih0 +1 | of them, after |Ih0 +1 |hmax (t) node expansions, the optimal one must be expanded, thus h h0 + 1. t Theorem 2. Let us write h(n) the smallest integer h such that Chmax (n) Then the loss is bounded as rn min(h(n), hmax (n) + 1) . 6 (4)\nh l=0\n\n(l)-d n.\n\n(3)\n\n\fProof. From Lemma 1 and the definition of h(n) we have\nh(n)-1 h(n)-1\n\nhmax (n)\nl=0\n\n|Il | Chmax (n)\n\n(l)-d < n,\nl=0\n\nLet (h, j) be the deepest node in Tn that has been expanded by the algorithm up to round n. Thus h h . Now, from the definition of the algorithm, we only expand a node when its value is larger n than the value of all the leaves of equal or lower depths. Thus, since the node (h, j) has been expanded, its value is at least as high as that of the optimal node (h + 1, i ) of depth h + 1 (which n n has not been expanded, by definition of h ). Thus n f (x(n)) f (xh,j ) f (xh +1,i ) f - (h + 1) f - (min(h(n), hmax (n) + 1)). n n Remark 2. This result appears very surprising: although the semi-metric is not known, the performance is almost as good as for DOO (see Theorem 1) which uses the knowledge of . The main difference is that the maximal depth hmax (n) appears both as a multiplicative factor in the definition of h(n) in (3) and as a threshold in the loss bound (4). Those two appearances of hmax (n) defines a tradeoff between deep (large hmax ) versus broad (small hmax ) types of exploration. We now illustrate the case of exponentially decreasing diameters. Corollary 2. Assume that (h) = c h for some c > 0 and < 1. Consider the two cases: The near-optimality d > 0. Let the depth function hmax (t) = t , for some > 0 arbitrarily small. Then, for n large enough (as a function of ) the loss of SOO is bounded as: rn c\nd+1 d\n\nthus from Lemma 2, when h(n) - 1 hmax (n) we have h h(n) - 1. Now in the case n h(n) - 1 > hmax (n), since the SOO algorithm does not expand nodes beyond depth hmax (n), we have h = hmax (n). Thus in all cases, h min(h(n) - 1, hmax (n)). n n\n\nC 1 - d\n\n1/d\n\nn-\n\n1- d\n\n. t. Then the loss of SOO is\n\n The near-optimality d = 0. Let the depth function hmax (t) = bounded as: rn c n min(1/C,1)-1 . Proof. From Theorem 1, when d > 0 we have\nh(n)\n\nn Chmax (n)\n\n\n(l)-d = cChmax (n)\nl=0\n\n -d(h(n)+1) - 1 -d - 1\n\nthus for the choice hmax (n) = n , we deduce -dh(n) ncC 1 - d . Thus h(n) is logarithmic in n and for n large enough (as a function of ), h(n) hmax (n) + 1, thus rn min(h(n), hmax (n) + 1) = (h(n)) c h(n) c\nh(n)\nd+1 d\n\n1-\n\nC 1 - d\n\n1/d\n\nn-\n\n1- d\n\n.\n\nNow, if d = 0 then n Chmax (n) l=0 (l)-d = Chmax (n)(h(n) + 1), thus for the choice hmax (n) = n we deduce that the loss decreases as: rn min(h(n), hmax (n) + 1) c\n n min(1/C,1)-1\n\n.\n\nRemark 3. The maximal depth function hmax (t) is still a parameter of the algorithm, which somehow influences the behavior of the algorithm (deep versus broad exploration of the tree). However, for a large class of problems (e.g. when d > 0) the choice of the order does not impact the asymptotic performance of the algorithm. Remark 4. Since our algorithm does not depend on , our analysis is actually true for any semimetric that satisfies Assumptions 1-4, thus Theorem 2 and Corollary 2 hold for the best possible choice of such a . In particular, we can think of problems for which there exists a semimetric such that the corresponding near-optimality dimension d is 0. Instead of describing a general class of problems satisfying this property, we illustrate in the next subsection non-trivial optimization problems in X = RD where there exists such that d = 0. 7\n\n\f4.3 Examples Example 1: Consider the previous Example 1 where X = [-1, 1]D and f is the function f (x) = 1 - x , where 1 is unknown. We have seen that DOO with the metric (x, y) = x - y 1 provides a polynomial loss rn = O n- D - whenever < , and an exponential loss rn 21-n when = . However, here is unknown. Now consider the SOO algorithm with the maximum depth function hmax (t) = t. As mentioned before, SOO does not require , thus we can apply the analysis for any that satisfies Assumptions 1-4. So let us consider (x, y) = x - y . Then (h) = 2-h , = 1, and the near-optimality dimension of f under is d = 0 (and C = 1). We deduce that the loss of SOO is rn 2(1- n) . Thus SOO provides a stretched-exponential loss without requiring the knowledge of . Note that a uniform grid provides the loss n-/D , which is polynomially decreasing only (and subject to the curse of dimensionality). Thus, in this example SOO is always better than both Uniform and DOO except if one knows perfectly and would use DOO with = (in which case we obtain an exponential loss). The fact that SOO is not as good as DOO optimally fitted comes from the truncation of SOO at a maximal depth hmax (n) = n (whereas DOO optimally fitted would explore the tree up to a depth linear in n). Example 2: The same conclusion holds for Example 2, where we consider a function f defined on [0, 1]D that is locally equivalent to x-x , for some unknown > 0 (see the precise assumptions in Section 3.3). We have seen that DOO using (x, y) = c x - y with < has a loss 1 rn = O n- D - , and when = , then d = 0 and the loss is rn = O(2-(n/C-1) ). Now by using SOO (which does not require the knowledge of ) with hmax (t) = t we deduce the stretched-exponential loss rn = O(2- n/C ) (by using (x, y) = x - y in the analysis, which gives (h) = 2-h and d = 0). 4.4 Comparison with the DIRECT algorithm The DIRECT (DIviding RECTangles) algorithm [JPS93, FK04, Gab01] is a Lipschitz optimization algorithm where the Lipschitz constant L of f is unknown. It uses an optimistic splitting technique similar to ours where at each round, it expands the set of nodes that have the highest upper-bound (as defined in DOO) for at least some value of L. To the best of our knowledge, there is no finite-time analysis of this algorithm (only the consistency property limn rn = 0 is proven in [FK04]). Our approach generalizes DIRECT and we are able to derive finite-time loss bounds in a much broader setting where the function is only locally smooth and the space is semi-metric. We are not aware of other finite-time analysis of global optimization algorithms that do not require the knowledge of the smoothness of the function.\n\n5 Conclusions\nWe presented two algorithms: DOO requires the knowledge of the semi-metric under which the function f is locally smooth (according to Assumption 2). SOO does not require this knowledge and performs almost as well as DOO optimally-fitted (i.e. for the best choice of satisfying Assumptions 1-4). We reported finite-time loss bounds using the near-optimality dimension d, which relates the local smoothness of f around its maximum and the quantity of near-optimal states, measured by the semi-metric . We provided illustrative examples of the performance of SOO in Euclidean spaces where the local smoothness of f is unknown. Possible future research directions include (i) deriving problem-dependent lower bounds, (ii) characterizing classes of functions f such that there exists a semi-metric for which f is locally smooth w.r.t. and whose corresponding near-optimal dimension is d = 0 (in order to have a stretchedexponentially decreasing loss), and (iii) extending the SOO algorithm to stochastic X -armed bandits (optimization of a noisy function) when the smoothness of f is unknown. Acknowledgements: French ANR EXPLO-RA (ANR-08-COSI-004) and the European project COMPLACS (FP7, grant agreement no 231495). 8\n\n\fReferences\nJ.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identification in multi-armed bandits. In Conference on Learning Theory, 2010. [ACBF02] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 47(2-3):235256, 2002. [AOS07] P. Auer, R. Ortner, and Cs. Szepesv ri. Improved rates for the stochastic continuum-armed bandit a problem. 20th Conference on Learning Theory, pages 454468, 2007. [BM10] S. Bubeck and R. Munos. Open loop optimistic planning. In Conference on Learning Theory, 2010. [BMS09] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Proc. of the 20th International Conference on Algorithmic Learning Theory, pages 2337, 2009. [BMSB11] L. Busoniu, R. Munos, B. De Schutter, and R. Babuska. Optimistic planning for sparsely stochastic systems. In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2011. [BMSS08] S. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesv ri. Online optimization of X-armed bandits. a In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 22, pages 201208. MIT Press, 2008. [BMSS11] S. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesv ri. X-armed bandits. Journal of Machine a Learning Research, 12:16551695, 2011. [BSY11] S. Bubeck, G. Stoltz, and J. Y. Yu. Lipschitz bandits without the Lipschitz constant. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory, 2011. [CM07] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Uncertainty in Artificial Intelligence, 2007. [FK04] D. E. Finkel and C. T. Kelley. Convergence analysis of the direct algorithm. Technical report, North Carolina State University, Center for, 2004. [Flo99] C.A. Floudas. Deterministic Global Optimization: Theory, Algorithms and Applications. Kluwer Academic Publishers, Dordrecht / Boston / London, 1999. [Gab01] J. M. X. Gablonsky. Modifications of the direct algorithm. PhD thesis, 2001. [GWMT06] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with patterns in monte-carlo go. Technical report, INRIA RR-6062, 2006. [Han92] E.R. Hansen. Global Optimization Using Interval Analysis. Marcel Dekker, New York, 1992. [HM08] J-F. Hren and R. Munos. Optimistic planning of deterministic systems. In European Workshop on Reinforcement Learning Springer LNAI 5323, editor, Recent Advances in Reinforcement Learning, pages 151164, 2008. [HT96] R. Horst and H. Tuy. Global Optimization ? Deterministic Approaches. Springer, Berlin / Heidelberg / New York, 3rd edition, 1996. [JPS93] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157181, 1993. [Kea96] R. B. Kearfott. Rigorous Global Search: Continuous Problems. Kluwer Academic Publishers, Dordrecht / Boston / London, 1996. [Kle04] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information Processing Systems, 2004. [KS06] L. Kocsis and Cs. Szepesv ri. Bandit based Monte-Carlo planning. In Proceedings of the 15th a European Conference on Machine Learning, pages 282293, 2006. [KSU08] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008. [Neu90] Neumaier. Interval Methods for Systems of Equations. Cambridge University Press, 1990. [Pin96] J.D. Pint r. Global Optimization in Action (Continuous and Lipschitz Optimization: Algorithms, e Implementations and Applications). Kluwer Academic Publishers, 1996. [SKKS10] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, pages 10151022, 2010. [Sli11] A. Slivkins. Multi-armed bandits on implicit metric spaces. In Advances in Neural Information Processing Systems, 2011. [SS00] R.G. Strongin and Ya.D. Sergeyev. Global Optimization with Non-Convex Constraints: Sequential and Parallel Algorithms. Kluwer Academic Publishers, Dordrecht / Boston / London, 2000. [ABM10]\n\n9\n\n\f", "award": [], "sourceid": 4304, "authors": [{"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}