{"title": "Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 1439, "page_last": 1447, "abstract": "We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that use gradient estimates based on random perturbations suffer a factor of at most $\\sqrt{\\dim}$ in convergence rate over traditional stochastic gradient methods, where $\\dim$ is the dimension of the problem. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problem-dependent quantities: they cannot be improved by more than constant factors.", "full_text": "Finite Sample Convergence Rates of Zero-Order\n\nStochastic Optimization Methods\n\nJohn C. Duchi1 Michael I. Jordan1,2 Martin J. Wainwright1,2\nAndre Wibisono1\n1Department of Electrical Engineering and Computer Science and 2Department of Statistics\n\nUniversity of California, Berkeley\n\nBerkeley, CA USA 94720\n\n{jduchi,jordan,wainwrig,wibisono}@eecs.berkeley.edu\n\nAbstract\n\nWe consider derivative-free algorithms for stochastic optimization problems that\nuse only noisy function values rather than gradients, analyzing their \ufb01nite-sample\nconvergence rates. We show that if pairs of function values are available, algo-\nrithms that use gradient estimates based on random perturbations suffer a factor\nof at most \u221ad in convergence rate over traditional stochastic gradient methods,\nwhere d is the problem dimension. We complement our algorithmic develop-\nment with information-theoretic lower bounds on the minimax convergence rate of\nsuch problems, which show that our bounds are sharp with respect to all problem-\ndependent quantities: they cannot be improved by more than constant factors.\n\n1\n\nIntroduction\n\nDerivative-free optimization schemes have a long history in optimization (see, for example, the\nbook by Spall [21]), and they have the clearly desirable property of never requiring explicit gradient\ncalculations. Classical techniques in stochastic and non-stochastic optimization, including Kiefer-\nWolfowitz-type procedures [e.g. 17], use function difference information to approximate gradients\nof the function to be minimized rather than calculating gradients. Researchers in machine learning\nand statistics have studied online convex optimization problems in the bandit setting, where a player\nand adversary compete, with the player choosing points \u03b8 in some domain \u0398 and an adversary\nchoosing a point x, forcing the player to suffer a loss F (\u03b8; x), where F (\u00b7; x) :\u0398 \u2192 R is a convex\nfunction [13, 5, 1]. The goal is to choose optimal \u03b8 based only on observations of function values\nF (\u03b8; x). Applications including online auctions and advertisement selection in search engine results.\nAdditionally, the \ufb01eld of simulation-based optimization provides many examples of problems in\nwhich optimization is performed based only on function values [21, 10], and problems in which\nthe objective is de\ufb01ned variationally (as the maximum of a family of functions), such as certain\ngraphical model and structured-prediction problems, are also natural because explicit differentiation\nmay be dif\ufb01cult [23].\n\nDespite the long history and recent renewed interest in such procedures, an understanding of their\n\ufb01nite-sample convergence rates remains elusive.\nIn this paper, we study algorithms for solving\nstochastic convex optimization problems of the form\n\nmin\n\u03b8\u2208\u0398\n\nf (\u03b8) := EP [F (\u03b8; X)] =!X\n\n(1)\nwhere \u0398 \u2286 Rd is a compact convex set, P is a distribution over the space X , and for P -almost every\nx \u2208X , the function F (\u00b7; x) is closed convex. Our focus is on the convergence rates of algorithms\nthat observe only stochastic realizations of the function values f (\u03b8).\n\nF (\u03b8; x)dP (x),\n\nWork on this problem includes Nemirovski and Yudin [18, Chapter 9.3], who develop a randomized\nsampling strategy that estimates \u2207F (\u03b8; x) using samples from the surface of the \"2-sphere, and\n\n1\n\n\fFlaxman et al. [13], who build on this approach, applying it to bandit convex optimization problems.\nThe convergence rates in these works are (retrospectively) sub-optimal [20, 2]: Agarwal et al. [2]\n\nprovide algorithms that achieve convergence rates (ignoring logarithmic factors) of O(poly(d)/\u221ak),\n\nwhere poly(d) is a polynomial in the dimension d, for stochastic algorithms receiving only single\nfunction values, but (as the authors themselves note) the algorithms are quite complicated.\n\nSome of the dif\ufb01culties inherent in optimization using only a single function evaluation can be alle-\nviated when the function F (\u00b7; x) can be evaluated at two points, as noted independently by Agarwal\net al. [1] and Nesterov [20]. The insight is that for small u, the quantity (F (\u03b8 + uZ; x)\u2212 F (\u03b8; x))/u\napproximates a directional derivative of F (\u03b8; x) and can thus be used in \ufb01rst-order optimization\nschemes. Such two-sample-based gradient estimators allow simpler analyses, with sharper conver-\ngence rates [1, 20], than algorithms that have access to only a single function evaluation in each\niteration. In the current paper, we take this line of work further, \ufb01nding the optimal rate of con-\nvergence for procedures that are only able to obtain function evaluations, F (\u00b7; X), for samples X.\nMoreover, adopting the two-point perspective, we present simple randomization-based algorithms\nthat achieve these optimal rates.\nMore formally, we study algorithms that receive paired observations Y (\u03b8, \u03c4 ) \u2208 R2, where \u03b8 and \u03c4\nare points the algorithm selects, and the tth sample is\n\nY t(\u03b8t,\u03c4 t) :=\"F (\u03b8t; X t)\nF (\u03c4 t; X t)#\n\n(2)\n\nwhere X t is a sample drawn from the distribution P . After k iterations, the algorithm returns a\n\nvector$\u03b8(k) \u2208 \u0398. In this setting, we analyze stochastic gradient and mirror-descent procedures [27,\n18, 6, 19] that construct gradient estimators using the two-point observations Y t. By a careful\nanalysis of the dimension dependence of certain random perturbation schemes, we show that the\nconvergence rate attained by our stochastic gradient methods is roughly a factor of \u221ad worse than\nthat attained by stochastic methods that observe the full gradient \u2207F (\u03b8; X). Under appropriate\nconditions, our convergence rates are a factor of \u221ad better than those attained by Agarwal et al.\n[1] and Nesterov [20]. In addition, though we present our results in the framework of stochastic\noptimization, our analysis applies to (two-point) bandit online convex optimization problems [13,\n5, 1], and we consequently obtain the sharpest rates for such problems. Finally, we show that\nthe convergence rates we provide are tight\u2014meaning sharp to within constant factors\u2014by using\ninformation-theoretic techniques for constructing lower bounds on statistical estimators.\n\n2 Algorithms\n\nStochastic mirror descent methods are a class of stochastic gradient methods for solving the problem\nmin\u03b8\u2208\u0398 f (\u03b8). They are based on a proximal function \u03c8, which is a differentiable convex function\nde\ufb01ned over \u0398 that is assumed (w.l.o.g. by scaling) to be 1-strongly convex with respect to the norm\n\u2019\u00b7\u2019 over \u0398. The proximal function de\ufb01nes a Bregman divergence D\u03c8 :\u0398 \u00d7 \u0398 \u2192 R+ via\n\nD\u03c8(\u03b8, \u03c4 ) := \u03c8(\u03b8) \u2212 \u03c8(\u03c4 ) \u2212 )\u2207\u03c8(\u03c4 ),\u03b8 \u2212 \u03c4* \u2265\n\n1\n2 \u2019\u03b8 \u2212 \u03c4\u20192 ,\n\n(3)\n\nwhere the inequality follows from the strong convexity of \u03c8 over \u0398. The mirror descent (MD)\nmethod proceeds in a sequence of iterations that we index by t, updating the parameter vector \u03b8t \u2208\n\u0398 using stochastic gradient information to form \u03b8t+1. At iteration t the MD method receives a\n(subgradient) vector gt \u2208 Rd, which it uses to update \u03b8t via\n1\n\n(4)\n\n\u03b8t+1 = argmin\n\n\u03b8\u2208\u0398 %>,\u03b8\u2019 +\n\nD\u03c8(\u03b8, \u03b8 t)( ,\n\n\u03b1(t)\n\nwhere {\u03b1(t)} is a non-increasing sequence of positive stepsizes.\nWe make two standard assumptions throughout the paper. Let \u03b8\u2217 denote a minimizer of the prob-\nlem (1). The \ufb01rst assumption [18, 6, 19] describes the properties of \u03c8 and the domain.\nAssumption A. The proximal function \u03c8 is strongly convex with respect to the norm \u2019\u00b7\u2019. The\ndomain \u0398 is compact, and there exists R < \u221e such that D\u03c8(\u03b8\u2217,\u03b8 ) \u2264 1\n\n2 R2 for \u03b8 \u2208 \u0398.\n\n2\n\n\fOur second assumption is standard for almost all \ufb01rst-order stochastic gradient methods [19, 24, 20],\nand it holds whenever the functions F (\u00b7; x) are G-Lipschitz with respect to the norm \u2019\u00b7\u2019. We use\n\u2019\u00b7\u2019\u2217 to denote the dual norm to \u2019\u00b7\u2019, and let g :\u0398 \u00d7X \u2192 Rd denote a measurable subgradient\nselection for the functions F ; that is, g(\u03b8; x) \u2208 \u2202F (\u03b8; x) with E[g(\u03b8; X)] \u2208 \u2202f (\u03b8).\nAssumption B. There is a constant G < \u221e such that the (sub)gradient selection g satis\ufb01es\nE[\u2019g(\u03b8; X)\u20192\n\n\u2217] \u2264 G2 for \u03b8 \u2208 \u0398.\n\nWhen Assumptions A and B hold, the convergence rate of stochastic mirror descent methods is\nwell understood [6, 19, Section 2.3]. Indeed, let the variables X t \u2208X be sampled i.i.d. according\nto P , set gt = g(\u03b8t; X t), and let \u03b8t be generated by the mirror descent iteration (4) with stepsize\n\u03b1(t) = \u03b1/\u221at. Then one obtains\n\nE[f ($\u03b8(k))] \u2212 f (\u03b8\u2217) \u2264\n\n1\n\n2\u03b1\u221ak\n\nR2 +\n\n\u03b1\n\u221ak\n\nG2.\n\n(5)\n\nFor the remainder of this section, we explore the use of function difference information to obtain\nsubgradient estimates that can be used in mirror descent methods to achieve statements similar to\nthe convergence guarantee (5).\n\n2.1 Two-point gradient estimates and general convergence rates\n\nIn this section, we show\u2014under a reasonable additional assumption\u2014how to use two samples of\nthe random function values F (\u03b8; X) to construct nearly unbiased estimators of the gradient \u2207f (\u03b8)\nof the expected function f . Our analytic techniques are somewhat different than methods employed\nin past work [1, 20]; as a consequence, we are able to achieve optimal dimension dependence.\nOur method is based on an estimator of \u2207f (\u03b8). Our algorithm uses a non-increasing sequence\nof positive smoothing parameters {ut} and a distribution \u00b5 on Rd (which we specify) satisfying\nE\u00b5[ZZ #] = I. Upon receiving the point X t \u2208X , we sample an independent vector Zt and set\n\ngt =\n\nF (\u03b8t + utZt; X t) \u2212 F (\u03b8t; X t)\n\nut\n\nZt.\n\n(6)\n\nWe then apply the mirror descent update (4) to the quantity gt.\nThe intuition for the estimator (6) of \u2207f (\u03b8) follows from an understanding of the directional deriva-\ntives of the random function realizations F (\u03b8; X). The directional derivative f $(\u03b8, z) of the function\nf at the point \u03b8 in the direction z is f $(\u03b8, z) := limu\u21930\n. The limit always exists when\nf is convex [15, Chapter VI], and if f is differentiable at \u03b8, then f $(\u03b8, z) = )\u2207f (\u03b8), z*. In addition,\nwe have the following key insight (see also Nesterov [20, Eq. (32)]): whenever \u2207f (\u03b8) exists,\n\nf (\u03b8+uz)\u2212f (\u03b8)\n\nu\n\nE[f $(\u03b8, Z)Z] = E[)\u2207f (\u03b8), Z* Z] = E[ZZ #\u2207f (\u03b8)] = \u2207f (\u03b8)\n\nif the random vector Z \u2208 Rd has E[ZZ #] = I. Intuitively, for ut small enough in the construc-\ntion (6), the vector gt should be a nearly unbiased estimator of the gradient \u2207f (\u03b8).\nTo formalize our intuition, we make the following assumption.\nAssumption C. There is a function L : X\u2192 R+ such that for (P -almost every) x \u2208X , the func-\ntion F (\u00b7; x) has L(x)-Lipschitz continuous gradient with respect to the norm \u2019\u00b7\u2019, and the quantity\nL(P )2 := E[L(X)2] < \u221e.\nWith Assumption C, we can show that gt is (nearly) an unbiased estimator of \u2207f (\u03b8t). Furthermore,\nfor appropriate random vectors Z, we can also show that gt has small norm, which yields better\nconvergence rates for mirror descent-type methods. (See the proof of Theorem 1.) In order to study\nthe convergence of mirror descent methods using the estimator (6), we make the following additional\nassumption on the distribution \u00b5.\nAssumption D. Let Z be sampled according to the distribution \u00b5, where E[ZZ #] = I. The quantity\nM (\u00b5)2 := E[\u2019Z\u20194 \u2019Z\u20192\n\u2217] < \u221e, and there is a constant s(d) such that for any vector g \u2208 Rd,\nE[\u2019)g, Z* Z\u20192\n\u2217] \u2264 s(d)\u2019g\u20192\n\u2217.\n\n3\n\n\fAs the next theorem shows, Assumption D is somewhat innocuous, the constant M (\u00b5) not even\nappearing in the \ufb01nal bound. The dimension (and norm) dependent term s(d), however, is impor-\ntant for our results. In Section 2.2 we give explicit constructions of random variables that satisfy\nAssumption D. For now, we present the following result.\nTheorem 1. Let {ut}\u2282 R+ be a non-increasing sequence of positive numbers, and let \u03b8t be\ngenerated according to the mirror descent update (4) using the gradient estimator (6). Under As-\nsumptions A, B, C, and D, if we set the step and perturbation sizes\n\n\u03b1(t) = \u03b1\n\nR\n\n2G)s(d)\u221at\nRG)s(d)\n\u221ak\n\nand ut = u\n\nL(P )M (\u00b5) \u00b7\n\nG)s(d)\nmax,\u03b1, \u03b1\u22121- + \u03b1u2 RG)s(d)\n\n1\nt\n\n,\n\n+ u\n\nRG)s(d) log k\n\nk\n\n,\n\nk\n\nthen\n\nE*f ($\u03b8(k)) \u2212 f (\u03b8\u2217)+ \u2264 2\nk.k\nwhere$\u03b8(k) = 1\n\nt=1 \u03b8t, and the expectation is taken with respect to the samples X and Z.\n\nThe proof of Theorem 1 requires some technical care\u2014we never truly receive unbiased gradients\u2014\nand it builds on convergence proofs developed in the analysis of online and stochastic convex op-\ntimization [27, 19, 1, 12, 20] to achieve bounds of the form (5). Though we defer proof to Ap-\npendix A.1, at a very high level, the argument is as follows. By using Assumption C, we see that\nfor small enough ut, the gradient estimator gt from (6) is close (in expectation with respect to X t)\nto f $(\u03b8t, Zt)Zt, which is an unbiased estimate of \u2207f (\u03b8t). Assumption C allows us to bound the\nmoments of the gradient estimator gt. By carefully showing that taking care to make sure that the\nerrors in gt as an estimator of \u2207f (\u03b8t) scale with ut, we given an analysis similar to that used to\nderive the bound (5) to obtain Theorem 1.\n\nBefore continuing, we make a few remarks. First, the method is reasonably robust to the selection\nof the step-size multiplier \u03b1 (as noted by Nemirovski et al. [19] for gradient-based MD methods).\n\nSo long as \u03b1(t) \u221d 1/\u221at, mis-specifying the multiplier \u03b1 results in a scaling at worst linear in\nmax{\u03b1, \u03b1\u22121}. Perhaps more interestingly, our setting of ut was chosen mostly for convenience\nand elegance of the \ufb01nal bound. In a sense, we can simply take u to be extremely close to zero (in\npractice, we must avoid numerical precision issues, and the stochasticity in the method makes such\nchoices somewhat unnecessary). In addition, the convergence rate of the method is independent\nof the Lipschitz continuity constant L(P ) of the instantaneous gradients \u2207F (\u00b7; X); the penalty for\nnearly non-smooth objective functions comes into the bound only as a second-order term. This\nsuggests similar results should hold for non-differentiable functions; we have been able to show that\nin some cases this is true, but a fully general result has proved elusive thus far. We are currently\ninvestigating strategies for the non-differentiable case.\n\nthat E[exp(\u2019g(\u03b8; X)\u20192\n\nUsing similar arguments based on Azuma-Hoeffding-type inequalities, it is possible to give high-\nprobability convergence guarantees [cf. 9, 19] under additional tail conditions on g, for example,\n\u2217 /G2)] \u2264 exp(1). Additionally, though we have presented our results as\nconvergence guarantees for stochastic optimization problems, an inspection of our analysis in Ap-\npendix A.1 shows that we obtain (expected) regret bounds for bandit online convex optimization\nproblems [e.g. 13, 5, 1].\n\n2.2 Examples and corollaries\n\nIn this section, we provide examples of random sampling strategies that give direct convergence\nrate estimates for the mirror descent algorithm with subgradient samples (6). For each corollary,\nwe specify the norm \u2019\u00b7\u2019, proximal function \u03c8, and distribution \u00b5, verify that Assumptions A, B, C,\nand D hold, and then apply Theorem 1 to obtain a convergence rate.\n\nWe begin with a corollary that describes the convergence rate of our algorithm when the expected\nfunction f is Lipschitz continuous with respect to the Euclidean norm \u2019\u00b7\u20192.\nCorollary 1. Given the proximal function \u03c8(\u03b8) := 1\n2] \u2264 G2 and\nthat \u00b5 is uniform on the surface of the \"2-ball of radius \u221ad. With the step size choices in Theorem 1,\n\n2, suppose that E[\u2019g(\u03b8; X)\u20192\n\n2 \u2019\u03b8\u20192\n\n4\n\n\fwe have\n\nE*f ($\u03b8(k)) \u2212 f (\u03b8\u2217)+ \u2264 2\n\nRG\u221ad\n\u221ak\n\nmax{\u03b1, \u03b1\u22121} + \u03b1u2 RG\u221ad\n\nk\n\n+ u\n\nRG\u221ad log k\n\nk\n\n.\n\nProof Note that \u2019Z\u20192 = \u221ad, which implies M (\u00b5)2 = E[\u2019Z\u20196\nsee that E[ZZ #] = I. Thus, for g \u2208 Rd we have\n\n2] = d3. Furthermore, it is easy to\n\nE[\u2019)g, Z* Z\u20192\n\n2] = dE[)g, Z*2] = dE[g#ZZ #g] = d\u2019g\u20192\n2 ,\n\nwhich gives us s(d) = d.\n\nThe rate provided by Corollary 1 is the fastest derived to date for zero-order stochastic optimiza-\ntion using two function evaluations. Both Agarwal et al. [1] and Nesterov [20] achieve rates of\n\nconvergence of order RGd/\u221ak. Admittedly, neither requires that the random functions F (\u00b7; X) be\ncontinuously differentiable. Nonetheless, Assumption C does not require a uniform bound on the\nLipschitz constant L(X) of the gradients \u2207F (\u00b7; X); moreover, the convergence rate of the method\nis essentially independent of L(P ).\n\nIn high-dimensional scenarios, appropriate choices for the proximal function \u03c8 yield better scaling\non the norm of the gradients [18, 14, 19, 12]. In online learning and stochastic optimization settings\nwhere one observes gradients g(\u03b8; X), if the domain \u0398 is the simplex, then exponentiated gradient\n\nalgorithms [16, 6] using the proximal function \u03c8(\u03b8) = .j \u03b8j log \u03b8j obtain rates of convergence\ndependent on the \"\u221e-norm of the gradients \u2019g(\u03b8; X)\u2019\u221e. This scaling is more palatable than de-\npendence on Euclidean norms applied to the gradient vectors, which may be a factor of \u221ad larger.\nSimilar results apply [7, 6] when using proximal functions based on \"p-norms. Indeed, making the\n2(p\u22121) \u2019\u03b8\u20192\nchoice p = 1 + 1/ log d and \u03c8(\u03b8) = 1\nCorollary 2. Assume that E[\u2019g(\u03b8; X)\u20192\n\u221e] \u2264 G2 and that \u0398 \u2286{ \u03b8 \u2208 Rd : \u2019\u03b8\u20191 \u2264 R}. Set \u00b5 to be\nuniform on the surface of the \"2-ball of radius \u221ad. Use the step sizes speci\ufb01ed in Theorem 1. There\nare universal constants C1 < 20e and C2 < 10e such that\n/\u03b1u2 + u log k0 .\n\nmax,\u03b1, \u03b1\u22121- + C2\n\nE*f ($\u03b8(k)) \u2212 f (\u03b8\u2217)+ \u2264 C1\n\np, we obtain the following corollary.\n\nRG\u221ad log d\n\n\u221ak\n\nThe proof of this corollary is somewhat involved. The main argument involves showing\n\nProof\nthat the constants M (\u00b5) and s(d) may be taken as M (\u00b5) \u2264 d6 and s(d) \u2264 24d log d.\nFirst, we recall [18, 7, Appendix 1] that our choice of \u03c8 is strongly convex with respect to the norm\n\u2019\u00b7\u2019p. In addition, if we de\ufb01ne q = 1 + log d, then we have 1/p + 1/q = 1, and \u2019v\u2019q \u2264 e\u2019v\u2019\u221e for\nany v \u2208 Rd and any d. As a consequence, we see that we may take the norm \u2019\u00b7\u2019 = \u2019\u00b7\u20191 and the dual\nnorm \u2019\u00b7\u2019\u2217 = \u2019\u00b7\u2019\u221e, and E[\u2019)g, Z* Z\u20192\nq] \u2264 e2E[\u2019)g, Z* Z\u20192\n\u221e]. To apply Theorem 1 with appropriate\nvalues from Assumption D, we now bound E[\u2019)g, Z* Z\u20192\n\u221e]; see Appendix A.3 for a proof.\nLemma 3. Let Z be distributed uniformly on the \"2-sphere of radius \u221ad. For any g \u2208 Rd,\n\nRG\u221ad log d\n\nk\n\nE[\u2019)g, Z* Z\u20192\n\n\u221e] \u2264 C \u00b7 d log d\u2019g\u20192\n\u221e ,\n\nwhere C \u2264 24 is a universal constant.\nAs a consequence of Lemma 3, the constant s(d) of Assumption D satis\ufb01es s(d) \u2264 Cd log d.\nFinally, we have the essentially trivial bound M (\u00b5)2 = E[\u2019Z\u20194\n\u221e] \u2264 d6 (we only need the\nquantity M (\u00b5) to be \ufb01nite to apply Theorem 1). Recalling that the set \u0398 \u2286{ \u03b8 \u2208 Rd : \u2019\u03b8\u20191 \u2264 R},\nour choice of \u03c8 yields [e.g., 14, Lemma 3]\n\n1 \u2019Z\u20192\n\n(p \u2212 1)D\u03c8(\u03b8, \u03c4 ) \u2264\n\n1\n2 \u2019\u03b8\u20192\n\np +\n\n1\n2 \u2019\u03c4\u20192\n\np + \u2019\u03b8\u2019p \u2019\u03c4\u2019p .\n\nWe thus \ufb01nd that D\u03c8(\u03b8, \u03c4 ) \u2264 2R2 log d for any \u03b8, \u03c4 \u2208 \u0398, and using the step and perturbation size\nchoices of Theorem 1 gives the result.\n\n5\n\n\fCorollary 2 attains a convergence rate that scales with dimension as \u221ad log d. This dependence\non dimension is much worse than that of (stochastic) mirror descent using full gradient informa-\ntion [18, 19]. The additional dependence on d suggests that while O(1/\u00012) iterations are required to\nachieve \u0001-optimization accuracy for mirror descent methods (ignoring logarithmic factors), the two-\npoint method requires O(d/\u00012) iterations to obtain the same accuracy. A similar statement holds for\nthe results of Corollary 1. In the next section, we show that this dependence is sharp: except for log-\narithmic factors, no algorithm can attain better convergence rates, including the problem-dependent\nconstants R and G.\n\n3 Lower bounds on zero-order optimization\n\nWe turn to providing lower bounds on the rate of convergence for any method that receives random\nfunction values. For our lower bounds, we \ufb01x a norm \u2019\u00b7\u2019 on Rd and as usual let \u2019\u00b7\u2019\u2217 denote its\ndual norm. We assume that \u0398= {\u03b8 \u2208 Rd : \u2019\u03b8\u2019 \u2264 R} is the norm ball of radius R. We study\nall optimization methods that receive function values of random convex functions, building on the\nanalysis of stochastic gradient methods by Agarwal et al. [3].\n\nMore formally, let Ak denote the collection of all methods that observe a sequence of data points\n\n(Y 1, . . . , Y k) \u2282 R2 with Y t = [F (\u03b8t, X t) F (\u03c4 t, X t)] and return an estimate$\u03b8(k) \u2208 \u0398. The classes\nof functions over which we prove our lower bounds consist of those satisfying Assumption B, that is,\nfor a given Lipschitz constant G > 0, optimization problems over the set FG. The set FG consists of\npairs (F, P ) as described in the objective (1), and for (F, P ) \u2208F G we assume there is a measurable\nsubgradient selection g(\u03b8; X) \u2208 \u2202F (\u03b8; X) satisfying EP [\u2019g(\u03b8; X)\u20192\nGiven an algorithm A\u2208 Ak and a pair (F, P ) \u2208F G, we de\ufb01ne the optimality gap\n\n\u2217] \u2264 G2 for \u03b8 \u2208 \u0398.\n\nEP [F (\u03b8; X)] ,\n\n(7)\n\n\u0001k(A, F, P, \u0398) := f ($\u03b8(k)) \u2212 inf\n\n\u03b8\u2208\u0398\n\nf (\u03b8) = EP1F ($\u03b8(k); X)2 \u2212 inf\n\n\u03b8\u2208\u0398\n\nwhere$\u03b8(k) is the output of A on the sequence of observed function values. The quantity (7) is a\nrandom variable, since the Y t are random and A may use additional randomness. We we are thus\ninterested in its expected value, and we de\ufb01ne the minimax error\n\n\u0001\u2217\nk(FG, \u0398) := inf\n\nA\u2208Ak\n\nsup\n\n(F,P )\u2208FG\n\nE[\u0001k(A, F, P, \u0398)],\n\n(8)\n\nwhere the expectation is over the observations (Y 1, . . . , Y k) and randomness in A.\n3.1 Lower bounds and optimality\n\nIn this section, we give lower bounds on the minimax rate of optimization for a few speci\ufb01c settings.\nWe present our main results, then recall Corollaries 1 and 2, which imply we have attained the min-\nimax rates of convergence for zero-order (stochastic) optimization schemes. The following sections\ncontain proof sketches; we defer technical arguments to appendices.\n\nWe begin by providing minimax lower bounds when the expected function f (\u03b8) = E[F (\u03b8; X)] is\nLipschitz continuous with respect to the \"2-norm. We have the following proposition.\n\nProposition 1. Let \u0398= ,\u03b8 \u2208 Rd : \u2019\u03b8\u20192 \u2264 R- and FG consist of pairs (F, P ) for which the sub-\ngradient mapping g satis\ufb01es EP [\u2019g(\u03b8; X)\u20192\n2] \u2264 G2 for \u03b8 \u2208 \u0398. There exists a universal constant\nc > 0 such that for k \u2265 d,\n\n\u0001\u2217\nk(FG, \u0398) \u2265 c\n\nGR\u221ad\n\u221ak\n\n.\n\nCombining the lower bounds provided by Proposition 1 with our algorithmic scheme in Section 2\nshows that our analysis is essentially sharp, since Corollary 1 provides an upper bound for the\nminimax optimality of RG\u221ad/\u221ak. The stochastic gradient descent algorithm (4) coupled with the\nsampling strategy (6) is thus optimal for stochastic problems with two-point feedback.\n\nNow we investigate the minimax rates at which it is possible to solve stochastic convex optimization\nproblems whose objectives are Lipschitz continuous with respect to the \"1-norm. As noted earlier,\nsuch scenarios are suitable for high-dimensional problems [e.g. 19].\n\n6\n\n\fProposition 2. Let \u0398= {\u03b8 \u2208 Rd : \u2019\u03b8\u20191 \u2264 R} and FG consist of pairs (F, P ) for which the\nsubgradient mapping g satis\ufb01es EP [\u2019g(\u03b8; X)\u20192\n\u221e] \u2264 G2 for \u03b8 \u2208 \u0398. There exists a universal constant\nc > 0 such that for k \u2265 d,\n\n\u0001\u2217\nk(FG, \u0398) \u2265 c\n\nGR\u221ad\n\u221ak\n\n.\n\nWe may again consider the optimality of our mirror descent algorithms, recalling Corollary 2. In this\ncase, the MD algorithm (4) with the choice \u03c8(\u03b8) = 1\np, where p = 1 + 1/ log d, implies\nthat there exist universal constants c and C such that\n\n2(p\u22121) \u2019\u03b8\u20192\n\nGR\u221ad\n\u221ak \u2264 \u0001\u2217\n\nc\n\nGR\u221ad log d\n\nk(FG, \u0398) \u2264 C\n\n\u221ak\n\nfor the problem class described in Proposition 2. Here the upper bound is again attained by our\ntwo-point mirror-descent procedure. Thus, to within logarithmic factors, our mirror-descent based\nalgorithm is optimal for these zero-order optimization problems.\n\nWhen full gradient information is available, that is, one has access to the subgradient selection\ng(\u03b8; X), the \u221ad factors appearing in the lower bounds in Proposition 1 and 2 are not present [3].\nThe \u221ad factors similarly disappear from the convergence rates in Corollaries 1 and 2 when one\nuses gt = g(\u03b8; X) in the mirror descent updates (4); said differently, the constant s(d) = 1 in\nTheorem 1 [19, 6]. As noted in Section 2, our lower bounds consequently show that in addition to\ndependence on the radius R and second moment G2 in the case when gradients are available [3],\n\nall algorithms must suffer an additional O(\u221ad) penalty in convergence rate. This suggests that for\n\nhigh-dimensional problems, it is preferable to use full gradient information if possible, even when\nthe cost of obtaining the gradients is somewhat high.\n\n3.2 Proofs of lower bounds\n\nWe sketch proofs for our lower bounds on the minimax error (8), which are based on the framework\nintroduced by Agarwal et al. [3]. The strategy is to reduce the optimization problem to a testing\nproblem: we choose a \ufb01nite set of (well-separated) functions, show that optimizing well implies that\none can identify the function being optimized, and then, as in statistical minimax theory [26, 25],\napply information-theoretic lower bounds on the probability of error in hypothesis testing problems.\nWe begin with a \ufb01nite set V\u2286 Rd, to be chosen depending on the characteristics of the function\nclass FG, and a collection of functions and distributions G = {(Fv, Pv) : v \u2208V}\u2286F G indexed by\nV. De\ufb01ne fv(\u03b8) = EPv [Fv(\u03b8; X)], and let \u03b8\u2217\nv \u2208 argmin\u03b8\u2208\u0398 fv(\u03b8). We also let \u03b4> 0 be an accuracy\nparameter upon which Pv and the following quantities implicitly depend. Following Agarwal et al.\n[3], we de\ufb01ne the separation between two functions as\n\n\u03c1(fv, fw) := inf\n\n\u03b8\u2208\u03981fv(\u03b8) + fw(\u03b8)2 \u2212 fv(\u03b8\u2217\n\nv) \u2212 fw(\u03b8\u2217\nw),\n\nand the minimal separation of the set V (this may depend on the accuracy parameter \u03b4) as\n\n\u03c1\u2217(V) := min{\u03c1(fv, fw) : v, w \u2208V , v 0= w}.\n\nsampled uniformly from V (see [3, Lemma 2]),\n\nFor any algorithm A\u2208 Ak, there exists a hypothesis test$v : (Y 1, . . . , Y k) \u2192V such that for V\nP($v(Y 1, . . . , Y k) 0= V ) \u2264\n\nE[\u0001k(A, Fv, Pv, \u0398)],\nwhere the expectation is taken over the observations (Y 1, . . . , Y k). By Fano\u2019s inequality [11],\n\nE[\u0001k(A, FV , PV , \u0398)] \u2264\n\n\u03c1\u2217(V)\n\n\u03c1\u2217(V)\n\nmax\nv\u2208V\n\n2\n\n2\n\n(9)\n\nP($v 0= V ) \u2265 1 \u2212\n\nI(Y 1, . . . , Y k; V ) + log 2\n\nlog |V|\n\n.\n\n(10)\n\nWe thus must upper bound the mutual information I(Y 1, . . . , Y k; V ), which leads us to the follow-\ning. (See Appendix B.3 for the somewhat technical proof of the lemma.)\n\n7\n\n\fLemma 4. Let X | V = v be distributed as N (\u03b4v, \u03c3 2I), and let F (\u03b8; x) = )\u03b8, x*. Let V be a\nuniform random variable on V\u2282 Rd, and assume that Cov(V ) 1 \u03bbI for some \u03bb \u2265 0. Then\n\nI(Y 1, Y 2, . . . , Y k; V ) \u2264\n\n\u03bbk\u03b42\n\u03c32 .\n\nUsing Lemma 4, we can obtain a lower bound on the minimax optimization error whenever the\ninstantaneous objective functions are of the form F (\u03b8; x) = )\u03b8, x*. Combining inequalities (9),\n(10), and Lemma 4, we \ufb01nd that if we choose the accuracy parameter\n\n,\n\n(11)\n\n\u03b4 =\n\n\u03c3\n\n2 \u2212 log 241/2\n\u221ak\u03bb3 log |V|\nE[\u0001k(A, F, P, \u0398)] \u2265 \u03c1\u2217(V)/4.\n\n(we assume that |V| > 4) we \ufb01nd that there exist a pair (F, P ) \u2208F G such that\n\n(12)\nThe inequality (12) can give concrete lower bounds on the minimax optimization error. In our lower\nbounds, we use Fv(\u03b8; x) = )\u03b8, x* and set Pv to be the N (\u03b4v, \u03c3 2I) distribution, which allows us to\napply Lemma 4. Proving Propositions 1 and 2 thus requires three steps:\n\n1. Choose the set V with the property that Cov(V ) 1 \u03bbI when V \u223c Uniform(V).\n2. Choose the variance parameter \u03c32 such that for each v \u2208V , the pair (Fv, Pv) \u2208F G.\n3. Calculate the separation value \u03c1\u2217(V) as a function of the accuracy parameter \u03b4.\nEnforcing (Fv, Pv) \u2208F G amounts to choosing \u03c32 so that E[\u2019X\u20192\n\u2217] \u2264 G2 for X \u223c N (\u03b4v, \u03c3 2I).\nBy construction fv(\u03b8) = \u03b4 )\u03b8, v*, which allows us to give lower bounds on the minimal separation\n\u03c1\u2217(V) for the choices of the norm constraint \u2019\u03b8\u2019 \u2264 R in Propositions 1 and 2. We defer formal\nproofs to Appendices B.1 and B.2, providing sketches here.\n\nFor Proposition 1, an argument using the probabilistic method implies that there are universal\nconstants c1, c2 > 0 for which there is a 1\n2 packing V of the \"2-sphere of radius 1 with size\nat least |V| \u2265 exp(c1d) and such that (1/|V|).v\u2208V vv# 1 c2Id\u00d7d/d. By the linearity of fv,\nwe \ufb01nd \u03c1(fv, fw) \u2265 \u03b4R/16, and setting \u03c32 = G2/(2d) and \u03b4 as in the choice (11) implies that\nE[\u2019X\u20192\nFor Proposition 2, we use the packing set V = {\u00b1ei : i = 1, . . . , d}. Standard bounds [8] on\nthe normal distribution imply that for Z \u223c N (0, I), we have E[\u2019Z\u20192\n\u221e] = O(log d). Thus we \ufb01nd\nthat for \u03c32 = O(G2/ log(d)) and suitably small \u03b4, we have E[\u2019X\u20192\n\u221e] = O(G2); linearity yields\n\u03c1(fv, fw) \u2265 \u03b4R for v 0= w \u2208V . Setting \u03b4 as in the expression (11) yields Proposition 2.\n\n2] \u2264 G2. Substituting \u03b4 and \u03c1\u2217(V) into the bound (12) proves Proposition 1.\n\n4 Discussion\n\nWe have analyzed algorithms for stochastic optimization problems that use only random function\nvalues\u2014as opposed to gradient computations\u2014to minimize an objective function. As our develop-\nment of minimax lower bounds shows, the algorithms we present, which build on those proposed by\nAgarwal et al. [1] and Nesterov [20], are optimal: their convergence rates cannot be improved (in a\nminimax sense) by more than numerical constant factors. As a consequence of our results, we have\nattained sharp rates for bandit online convex optimization problems with multi-point feedback. We\nhave also shown that there is a necessary sharp transition in convergence rates between stochastic\ngradient algorithms and algorithms that compute only function values. This result highlights the\nadvantages of using gradient information when it is available, but we recall that there are many\napplications in which gradients are not available.\n\nFinally, one question that this work leaves open, and which we are actively attempting to address,\nis whether our convergence rates extend to non-smooth optimization problems. We conjecture that\nthey do, though it will be interesting to understand the differences between smooth and non-smooth\nproblems when only zeroth-order feedback is available.\n\nAcknowledgments This material supported in part by ONR MURI grant N00014-11-1-0688 and\nthe U.S. Army Research Laboratory and the U.S. Army Research Of\ufb01ce under grant no. W911NF-\n11-1-0391. JCD was also supported by an NDSEG fellowship and a Facebook PhD fellowship.\n\n8\n\n\fReferences\n\n[1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point\nIn Proceedings of the Twenty Third Annual Conference on Computational Learning\n\nbandit feedback.\nTheory, 2010.\n\n[2] A. Agarwal, D. P. Foster, D. Hsu, S. M. Kakade, and A. Rakhlin.\n\ntimization with bandit\nhttp://arxiv.org/abs/1107.1744.\n\nfeedback.\n\nSIAM Journal on Optimization, To appear, 2011.\n\nStochastic convex op-\nURL\n\n[3] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the\noracle complexity of convex optimization. IEEE Transactions on Information Theory, 58(5):3235\u20133249,\nMay 2012.\n\n[4] K. Ball. An elementary introduction to modern convex geometry. In S. Levy, editor, Flavors of Geometry,\n\npages 1\u201358. MSRI Publications, 1997.\n\n[5] P. L. Bartlett, V. Dani, T. P. Hayes, S. M. Kakade, A. Rakhlin, and A. Tewari. High-probability regret\nbounds for bandit online linear optimization. In Proceedings of the Twenty First Annual Conference on\nComputational Learning Theory, 2008.\n\n[6] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex opti-\n\nmization. Operations Research Letters, 31:167\u2013175, 2003.\n\n[7] A. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization method\n\nwith applications to tomography. SIAM Journal on Optimization, 12:79\u2013108, 2001.\n\n[8] V. Buldygin and Y. Kozachenko. Metric Characterization of Random Variables and Random Processes,\n\nvolume 188 of Translations of Mathematical Monographs. American Mathematical Society, 2000.\n\n[9] N. Cesa-Bianchi, A. Conconi, and C.Gentile. On the generalization ability of on-line learning algorithms.\n\nIn Advances in Neural Information Processing Systems 14, pages 359\u2013366, 2002.\n\n[10] A. Conn, K. Scheinberg, and L. Vicente.\n\nIntroduction to Derivative-Free Optimization, volume 8 of\n\nMPS-SIAM Series on Optimization. SIAM, 2009.\n\n[11] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley, 2006.\n[12] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Pro-\n\nceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010.\n\n[13] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting:\ngradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on\nDiscrete Algorithms (SODA), 2005.\n\n[14] C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3), 2002.\n[15] J. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms I & II. Springer,\n\n1996.\n\n[16] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Infor-\n\nmation and Computation, 132(1):1\u201364, Jan. 1997.\n\n[17] H. J. Kushner and G. Yin.\n\nStochastic Approximation and Recursive Algorithms and Applications.\n\nSpringer, Second edition, 2003.\n\n[18] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley, 1983.\n[19] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[20] Y. Nesterov.\n\nof\nhttp://www.ecore.be/DPs/dp_1297333890.pdf, 2011.\n\nRandom gradient-free minimization\n\nconvex\n\nfunctions.\n\nURL\n\n[21] J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control.\n\nWiley, 2003.\n\n[22] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing:\n\nTheory and Applications, chapter 5, pages 210\u2013268. Cambridge University Press, 2012.\n\n[23] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[24] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research, 11:2543\u20132596, 2010.\n\n[25] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of\n\nStatistics, 27(5):1564\u20131599, 1999.\n\n[26] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423\u2013435. Springer-Verlag,\n\n1997.\n\n[27] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proceedings\n\nof the Twentieth International Conference on Machine Learning, 2003.\n\n9\n\n\f", "award": [], "sourceid": 692, "authors": [{"given_name": "Andre", "family_name": "Wibisono", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "John", "family_name": "Duchi", "institution": null}]}