{"title": "Efficiently avoiding saddle points with zero order methods: No gradients required", "book": "Advances in Neural Information Processing Systems", "page_first": 10066, "page_last": 10077, "abstract": "We consider the case of derivative-free algorithms for non-convex optimization, also known as zero order algorithms, that use only function evaluations rather than gradients. For a wide variety of gradient approximators based on finite differences, we establish asymptotic convergence to second order stationary points using a carefully tailored application of the Stable Manifold Theorem.  Regarding efficiency, we introduce a noisy zero-order method that converges to second order stationary points, i.e avoids saddle points. Our algorithm uses only $\\tilde{\\mathcal{O}}(1 / \\epsilon^2)$ approximate gradient calculations and, thus, it matches the converge rate guarantees of their exact gradient counterparts up to constants. In contrast to previous work, our convergence rate analysis avoids imposing additional dimension dependent slowdowns in the number of iterations required for non-convex zero order optimization.", "full_text": "Ef\ufb01ciently avoiding saddle points\n\nwith zero order methods: No gradients required\n\nLampros Flokas\u2217\n\nDepartment of Computer Science\n\nColumbia University\nNew York, NY 10025\n\nEmmanouil V. Vlatakis-Gkaragkounis\u2217\n\nDepartment of Computer Science\n\nColumbia University\nNew York, NY 10025\n\nlamflokas@cs.columbia.edu\n\nemvlatakis@cs.columbia.edu\n\nGeorgios Piliouras\n\nEngineering Systems and Design\n\nSingapore University of Technology and Design\n\nSingapore\n\ngeorgios@sutd.edu.sg\n\nAbstract\n\nWe consider the case of derivative-free algorithms for non-convex optimization,\nalso known as zero order algorithms, that use only function evaluations rather than\ngradients. For a wide variety of gradient approximators based on \ufb01nite differences,\nwe establish asymptotic convergence to second order stationary points using a\ncarefully tailored application of the Stable Manifold Theorem. Regarding ef\ufb01ciency,\nwe introduce a noisy zero-order method that converges to second order stationary\npoints, i.e avoids saddle points. Our algorithm uses only \u02dcO(1/\u00012) approximate\ngradient calculations and, thus, it matches the converge rate guarantees of their exact\ngradient counterparts up to constants. In contrast to previous work, our convergence\nrate analysis avoids imposing additional dimension dependent slowdowns in the\nnumber of iterations required for non-convex zero order optimization.\n\nIntroduction\n\n1\nGiven a function f : Rd \u2192 R, solving the problem\n\nx\u2217 = arg min\nx\u2208Rd\n\nf (x)\n\nis one of the building blocks that many machine learning algorithms are based on. The dif\ufb01culty\nof this problem varies signi\ufb01cantly depending on the properties of f and the way we can access\ninformation about it. The general case of non-convex functions makes the problem signi\ufb01cantly more\nchallenging, since \ufb01rst order stationary points can be global or local optima as well as saddle points.\nIn fact, discovering global optima is an NP hard problem in general and even for quartic functions\nverifying local optima is a co-NP complete problem [MK87, LPP+19].\nWhile local optima may be satisfactory for some applications in machine learning [CHM+15],\nsaddle points can make high dimensional non convex optimization tasks signi\ufb01cantly more dif\ufb01cult\n[DPG+14, SQW18]. Therefore, researchers have focused their efforts on functions possessing\nthe strict saddle property. Under this property, Hessians of f evaluated at saddle points have at\nleast one negative eigenvalue making detection of saddle points tractable. Given this assumption,\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmethods that use second order information like computing Hessians or Hessian-vector products\n[NP06, CD16, AAB+17] can converge to second order stationary points (SOSPs) and thus avoid\nstrict saddle points. Recent work [GHJY15, Lev16, JGN+17, LPP+19, AL18, JNJ18] has also\nshowed that gradient descent (and its variants) can also avoid strict saddle points and converge to\nlocal minima.\nUnfortunately access to gradient evaluations is not available in all settings of interest. Even with\nthe advent of automatic differentiation software, there are several applications where computation\nof gradients is either computationally inef\ufb01cient or even impossible. Examples of such applications\nare hyper-parameter tuning of machine learning models [SLA12, SHCS17, CRS+18], black-box\nadversarial attacks on deep neural networks [PMG+17, MMS+18, CZS+17], computer network\ncontrol [LCCH18], variational approaches to graphical models [WJ08] and simulation based [RK16,\nSpa03] or bandit feedback optimization [ADX10, CG19]. Zero order methods, also known as black-\nbox methods, try to address these issues by employing only evaluations of the function f during the\noptimization procedure. The case of convex functions is well understood [NS17, DJWW15, ADX10].\nFor the non-convex case, there has been a considerable amount of work on the convergence to \ufb01rst\norder stationary points both for deterministic settings [NS17] and stochastic ones [GL13, WDBS18,\nBG18, LKC+18, GHH16].\nThe case of SOSPs has been so far comparatively under-studied. It has been established that SOSPs\nare achievable through zero order trust region methods that employ fully quadratic models [CSV09].\nThe disadvantage of trust region methods is that their computation cost per iteration is O(d4) which\nbecomes quickly prohibitive as we increase the number of dimensions d. More recently, the authors\nof [JLGJ18] studied the case of \ufb01nding local minima of functions having access only to approximate\nfunction or gradient evaluations. They manage to reduce zero order optimization to the stochastic\n\ufb01rst order optimization of a Gaussian smoothed version of f. While this approach yields guarantees\nof convergence to SOSPs , each stochastic gradient evaluation requires O(poly(d, 1/\u0001)) number\nof function evaluations. This leads to signi\ufb01cantly less ef\ufb01cient optimization algorithms when\ncompared to their \ufb01rst order counterparts. It is therefore yet unclear if there are scalable zero\norder methods that can safely avoid strict saddle points and always converge to local minima\nof f. To the best of our knowledge, our work is the \ufb01rst one to establish a positive answer to\nthis important question.\nOur results. We prove that zero order optimization methods solve general non-convex problems\nef\ufb01ciently. In a nutshell, we present a family of of zero order optimization methods which provably\nconverge to SOSPs . Our proof includes a new, elaborating analysis of Stable Manifold Theorem\n(See Section 4). Additionally, the number of the approximate gradient evaluations match the standard\nbounds for \ufb01rst order methods in non-convex problems (see Table 1 & Section 5).\n\nAlgorithm\nOracle\nTheorem 3\nApproximate Gradient\n[LPP+19]\nExact Gradient\nTheorem 4\nApprox. Gradient + Noise\nFPSGD [JLGJ18] Approx. Gradient + Noise\nZPSGD [JLGJ18]\n[JGN+17]\n\nFunction Evaluations + Noise\nExact Gradient + Noise\n\nEvaluations of f\n\nIterations\nAsymptotic Asymptotic\nAsymptotic\n\u02dcO(1/\u00012)\n\u02dcO(d/\u00012)\n\u02dcO(1/\u00012)\n\u02dcO(1/\u00012)\n\n-\n\u02dcO(d/\u00012)\n\u02dcO(d4/\u00014)\n\u02dcO(d2/\u00015)\n-\n\nTable 1: Oracle model and iteration complexity to SOSPs .\n\nAlgorithms. Instead of focusing on a single \ufb01nite differences algorithms, we construct a general\nframework of approximate gradient oracles that generalizes over many \ufb01nite differences approaches\nin the literature. We then use these approximate gradient oracles to devise approximate gradient\ndescent algorithms. For more details see Section 3.3 and De\ufb01nition 4.\nAsymptotic convergence. We use the stable manifold theorem to prove that zero order methods can\nalmost surely avoid saddle points. In contrast to the analysis of [LPP+19] for \ufb01rst order methods, the\nzero order case is more demanding. Convergence to \ufb01rst order stationary points requires changing\nthe gradient approximation accuracy over the iterations and, thus, the equivalent dynamical system is\ntime dependent. By reducing our time dependent dynamical system to a time invariant one de\ufb01ned in\n\n2\n\n\fan expanded state, we are able to obtain provable guarantees about avoiding saddle points. To extend\nour guarantees of convergence to deterministic choices of the initial accuracy, we provide a carefully\ntailored application of the Stable Manifold Theorem that analyzes the structure of the stable manifolds\nof the dynamical system. Our results on saddle point avoidance extend to functions with non isolated\ncritical points. To address this, we provide suf\ufb01cient conditions for point-wise convergence of the\niterates of approximate gradient descent methods for the case of analytic functions.\nConvergence rates for noisy dynamics. In order to produce fast convergence rates, as in the case\nof \ufb01rst order methods [JGN+17], it is useful to consider perturbed/noisy versions of the dynamics.\nOnce again the case of zero order methods poses distinct hurdles. Close to critical points of f,\napproximations of the potentially arbitrarily small gradient can be very noisy. Iterates of exact\ngradient descent and approximate gradient descent may diverge signi\ufb01cantly in this case. In fact,\nprovably escaping saddle points by guaranteeing decrease of value of f is more challenging for the\ncase of approximate gradient descent since it is not a descent algorithm. A key technical step is to\nshow that the negative curvature dynamics that enable gradient descent to escape saddle points are\nrobust to gradient approximation errors. As long as the gradient approximation error is smaller than\na \ufb01xed a-priori known threshold, zero order methods can provably escape saddle points. Based on\nthis, we are able to prove that zero order methods can converge to approximate SOSPs with the same\nnumber of approximate gradient evaluations provided by [JGN+17] up to constants.\nIt is worth pointing out that achieving an \u02dcO(\u0001\u22122) bound of approximate gradient evaluations requires\nconceptually different techniques from other recent approaches in zero order methods. Indeed,\nprevious work on randomized and stochastic zero order optimization [NS17, GL13] has relied on\ntreating randomized approximate gradients of f as in expectation exact gradients of a carefully\nconstructed smoothed version of f. Then with some additional work, convergence arguments for\nthe smooth version of f can be transferred to f itself. Although these arguments are applicable to\nour case as well, as shown by the work of [JLGJ18], they also lead to a slowdown both in terms of\nthe dimension d and the required accuracy \u0001. The main reasons behind this slowdown are that the\nLipschitz constants of the smoothed version of f depend on d and the high variance of the stochastic\ngradient estimators. To sidestep both issues, we analyze the effect of gradient approximation error\ndirectly on the optimization of f.\n\n2 Related Work\n\nOur work builds and improves upon previous \ufb01nite difference approaches for non-convex optimization\nand provides SOSP guarantees previously only reserved to computationally expensive methods.\nFirst Order Algorithms A recent line of work has shown that gradient descent and variations of it\ncan actually converge to SOSPs . Speci\ufb01cally, [LPP+19] shows that gradient descent starting from a\nrandom point can eventually converge to SOSPs with probability one. [JGN+17, JNJ18] modi\ufb01ed\nstandard gradient descent using perturbations to provide an algorithm that converges to SOSPs in\nO(poly(log d, 1/\u0001)) iterations. As noted in the introduction, the zero order case poses additional\nhurdles compared to the \ufb01rst order one. Our work, by addressing these hurdles effectively extends\nthe guarantees provided by [LPP+19, JGN+17] to zero order methods.\nZero Order Algorithms Approximating gradients using \ufb01nite differences methods has been the\nstandard approach for both for convex and non-convex zero order optimization.[NS17] established\nconvergence properties even for randomized gradient oracles. Recently, [DJWW15] provided optimal\nguarantees for stochastic convex optimization up to logarithmic factors. For the more general case\nof stochastic non-convex optimization there has been extensive work covering several aspects of\nthe problem: distributed [HZ18], asynchronous [LZH+16], high-dimensional [WDBS18, BG18]\noptimization and variance reduction [LKC+18, GHH16]. It is signi\ufb01cant to mention that the afore-\nmentioned work is focused on convergence to \u0001\u2212\ufb01rst order stationary points.\nRegarding SOSPs , [CSV09] showed that trust region methods that employ fully quadratic models can\nconverge to SOSPs at the cost of O(d4) operations per iteration. The authors of [JLGJ18] studied the\nconvergence to SOSPs using approximate function or gradient evaluations. While both approaches\nare applicable for the zero order setting with exact function evaluations, as we will see in Section\n3.4, this type of reduction results in algorithms that require substantially more function evaluations\nto reach an \u0001-SOSP . Our work provides provable guarantees of convergence at signi\ufb01cantly faster\nrates.\n\n3\n\n\f3 Preliminaries\n\n3.1 Notation\nWe will use lower case bold letters x, y to denote vectors. (cid:107)\u00b7(cid:107) will be used to denote the spectral\nnorm and the (cid:96)2 vector norm. \u03bbmin(\u00b7) will be used to denote the minimum eigenvalue of a matrix. If\ng is a vector valued differentiable function then Dg denotes the differential of function g. We will\nuse {e1, e2, . . . ed} to refer to the standard orthonormal basis of Rd. Also C n is the set of n times\ncontinuously differentiable functions. Bx(r) refers to the ball of radius r centered at x. Finally, \u00b5(S)\nis the Lebesgue measure of a measurable set S \u2286 Rd.\n\n3.2 De\ufb01nitions\nA function f : Rd \u2192 R is said to be L-continuous, (cid:96)-gradient, \u03c1-Hessian Lipschitz if for every\nx, y \u2208 Rd (cid:107)f (x) \u2212 f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107), (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 (cid:96)(cid:107)x \u2212 y(cid:107), (cid:107)\u22072f (x) \u2212 \u22072f (y)(cid:107) \u2264\n\u03c1(cid:107)x \u2212 y(cid:107) correspondingly. Additionally, we can de\ufb01ne approximate \ufb01rst order stationary points as:\nDe\ufb01nition 1 (\u0001-\ufb01rst order stationary point). Let f : Rd \u2192 R be a differentiable function. Then\nx \u2208 Rd is a \ufb01rst order stationary point of f if (cid:107)\u2207f (x)(cid:107) \u2264 \u0001.\nA \ufb01rst order stationary point can be either a local minimum, a local maximum or a saddle point.\nFollowing the terminology of [LPP+19] and [JGN+17], we will include local maxima in saddle\npoints since they are both undesirable for our minimization task. Under this de\ufb01nition, strict saddle\npoints can be identi\ufb01ed as follows:\nDe\ufb01nition 2 (Strict saddle point). Let f : Rd \u2192 R be a twice differentiable function. Then x \u2208 Rd\nis a strict saddle point of f if (cid:107)\u2207f (x)(cid:107) = 0 and \u03bbmin(\u22072f (x)) < 0.\nTo avoid convergence to strict saddle points, we need to converge to SOSPs . In order to study\nthe convergence rate of algorithms that converge to SOSPs , we need to de\ufb01ne some notion of\napproximate SOSPs . Following the convention of [JGN+17] we de\ufb01ne the following:\nDe\ufb01nition 3 (\u0001-SOSP ). Let f : Rd \u2192 R be a \u03c1-Hessian Lipschitz function. Then x \u2208 Rd is an\n\u0001-second order order stationary point of f if (cid:107)\u2207f (x)(cid:107) \u2264 \u0001 and \u03bbmin(\u22072f (x)) \u2265 \u2212\u221a\n3.3 Gradient Approximation using Zero Order Information\n\n\u03c1\u0001.\n\nOne of the key ways that enables zero order methods to converge quickly is using approximations of\nthe gradient based on \ufb01nite differences approaches. Here we will show how forward differencing\ncan provide these approximate gradient calculations. Without much additional effort we can get the\nsame results for other \ufb01nite differences approaches like backward and symmetric difference as well\nas \ufb01nite differences approaches with higher order accuracy guarantees. Let us de\ufb01ne the gradient\napproximation function based on forward difference rf : Rd \u00d7 R \u2192 Rd\n\n(cid:80)d\n\n\uf8f1\uf8f2\uf8f3\n\nf (x + hel) \u2212 f (x)\n\nel when h (cid:54)= 0\n\n(1)\n\nrf (x, h) =\n\nl=0\n\nh\n\u2207f (x) if h = 0\n\nThis function takes two arguments: A vector x where the gradient should be approximated as well as\na scalar value h that controls the approximation accuracy of the estimator. An additional property\nthat will be of interest when we analyze approximate gradient descent is the fact that rf is Lipschitz.\nBased on the de\ufb01nition one can show:\nLemma 1. Let f be (cid:96)-gradient Lipschitz. Then rf (\u00b7, h) as de\ufb01ned in Equation 1 is\n\u221a\nall h \u2208 R and \u2200h \u2208 R, x \u2208 Rd : (cid:107)rf (x, h) \u2212 \u2207f (x)(cid:107) \u2264 (cid:96)\n\nd(cid:96) Lipschitz for\n\nd|h|.\n\n\u221a\n\n3.4 Black box reductions to \ufb01rst order methods\n\nAs shown in the works of [NS17, GL13], zero order optimization is reducible to stochastic \ufb01rst\norder optimization. The reduction relies on treating randomized approximate gradients of f as in\nexpectation exact gradients of a carefully constructed smoothed version of f. These arguments are\nalso applicable to our case as well. FPSG, one of the approaches of [JLGJ18], naively leads to a large\n\n4\n\n\f\u221a\n\npoly(d) dependence in the convergence rate. More speci\ufb01cally one can show that [JLGJ18]\u2019s FPSG\nmethod needs \u02dcO(d3/\u00014) evaluations of \u2207g to converge to an \u0001-SOSP . The main reason behind this\ndimension dependent slowdown is that the Hessian Lipschitz constant of the smoothed version of\ng is O(\u03c1\nd). An alternative approach in [JLGJ18] named ZPSG builds gradient estimators using\nfunction evaluations directly. The main source slowdown here is the high variance of the stohastic\ngradients. An analysis of those methods for the case where exact function evaluations are available\ncan be found in the Appendix.\nIn the next sections we will provide an alternative analysis that accounts for the gradient approximation\nerrors on the optimization of f directly. Thus, we will be able to sidestep the above issues and provide\nfaster convergence rates and better sample complexity.\n\n4 Approximate Gradient Descent\n\n4.1 Description\n\nIt is easy to see that conceptually any iterative optimization method can be expressed as a dynamical\nsystem of the form {xk+1 = g(xk)} where xk is the current solution iterate that gets updated through\nan update function g. Additionally, for \ufb01rst order methods strict saddle points correspond to the\nunstable \ufb01xed points of the dynamical system. These key observations have motivated [LPP+19]\nto use the Stable Manifold Theorem (SMT) [Shu87] in order to prove that gradient descent avoids\nstrict saddle points. Intuitively, SMT formalizes why convergence to unstable \ufb01xed points is unlikely\nstarting from a local region around an unstable \ufb01xed point. Adding the requirement that g is a global\ndiffeomorphism, [LPP+19] generalizes the conclusions of SMT to the whole space.\nIn order to prove similar guarantees for a zero order algorithm using approximate gradient evaluations,\nwe will need to construct a new dynamical system that is applicable to our zero order setting. The\nstate of our dynamical system \u03c7k consists of two parts: The current solution iterate xk that is a vector\nin Rd and a scalar value h \u2208 R that controls the quality of the gradient approximation. Speci\ufb01cally\nwe have\n\n\u03c7k+1 = g0(\u03c7k) (cid:44)\n\n(2)\nwhere \u03b7, \u03b2 \u2208 R+ positive scalar parameters and functions qx : Rd \u00d7 R \u2192 Rd and qh : R \u2192 R.\nThe function qx can be seen as the gradient approximation oracle used by the dynamical system as\ndescribed in Section 3.3. The function qh is responsible for controlling the accuracy of the gradient\napproximation. As we shall see later, it is important that hk converges to 0 so that the stable points of\ng0 are the same as in gradient descent.\n\n\u03b2qh(hk)\n\nhk+1\n\n=\n\n(cid:18)xk+1\n\n(cid:19)\n\n(cid:18)xk \u2212 \u03b7qx(xk, hk)\n(cid:19)\n\n4.2 Avoiding Strict Saddle points\n\nIn this section we will provide suf\ufb01cient conditions that the parameters \u03b7, \u03b2 must satisfy so that the\nupdate rule of Equation 2 avoids convergence to strict saddle points. To do this we will need to\nintroduce some properties of g0.\nDe\ufb01nition 4 ((L, B, c)-Well-behaved function). Let f : Rd \u2192 R \u2208 C 2 be a (cid:96)-gradient Lipschitz\nfunction. A function g0 of the form of Equation 2 is a (L, B, c)-well behaved function (for function f)\nif it has the following properties: i) qx, qh \u2208 C 1 with qh(0) = 0. ii) \u2200h \u2208 R : qx(\u00b7, h) is L Lipschitz\nand 0 < \u2202qh(h)\n\n\u2202h \u2264 B. iii) \u2200(x, h) \u2208 Rd+1 : (cid:107)qx(x, h) \u2212 \u2207f (x)(cid:107) \u2264 c|h|.\n\n\u221a\n\n\u221a\n\nd(cid:96), B = 1, c =\n\nd(cid:96) using qx = rf and qh = h.\n\nGiven this de\ufb01nition and Lemma 1, it is clear that we can always construct (L, B, c)-well-behaved\nfunctions for L =\nIn the following lemmas and theorems we will require that \u03b2B < 1. Under this assumption \u03b2qh is a\ncontraction having 0 as its only \ufb01xed point so for all \ufb01xed points of g0 we know that h = 0. Notice\nalso that when h = 0, we have qx(x, 0) = \u2207f (x) and therefore the x coordinates of \ufb01xed points of\ng0 must coincide with \ufb01rst order stationary points of f. In fact, in the Appendix we prove that there\nis a one to one mapping between strict saddles of f and unstable \ufb01xed points of g0. Using the same\nassumptions, we also get that det(Dg0(\u00b7)) (cid:54)= 0. Putting all together, we are able to prove our \ufb01rst\nmain result.\n\n5\n\n\fTheorem 1. Let g0 be a (L, B, c)-well-behaved function for function f. Let X\u2217\nsaddle points of f. Then if \u03b7 < 1\n\nB : \u2200h0 \u2208 R : \u00b5({x0 : limk\u2192\u221e xk \u2208 X\u2217\n\nL and \u03b2 < 1\n\nf}) = 0.\n\nf be the set of strict\n\nNotice that the random initialization refers only to the x0\u2019s domain.\nIndeed a straightforward\napplication of the result of [LPP+19] would guarantee a saddle-avoidance lemma only under an extra\nrandom choice of h0. Such a result would not be able to clarify if saddle-avoidance stems from the\ninstability of the \ufb01xed point, just like in \ufb01rst order methods, or from the additional randomness of h0.\nThe key insight provided by the SMT is that the all the initialization points that eventually converge\nto an unstable \ufb01xed point lie in a low dimensional manifold. Thus, to obtain a stronger result we\nhave to understand how SMT restricts the dimensionality of this stable manifold for a \ufb01xed h0. The\nstructure of the eigenvectors of the Jacobian of g0 around a \ufb01xed point reveals that such an interesting\ndecoupling is \ufb01nally achievable.\n\n4.3 Convergence\n\n2\n\n(cid:0)(cid:107)\u2207f (xk)(cid:107)2 \u2212 (cid:107)\u03b5k(cid:107)2(cid:1) .\n\nIn the previous section we provided suf\ufb01cient conditions to avoid convergence to strict saddle points.\nThese results are meaningful however only if limk\u2192\u221e xk exists. Therefore, in this section we will\nprovide suf\ufb01cient conditions such that the dynamic system of g0 converges. Given that strict saddle\npoints are avoided, it is suf\ufb01cient to prove convergence to \ufb01rst order stationary points. Let the error\nof the gradient approximation be \u03b5k = qx(xk, hk) \u2212 \u2207f (xk). Firstly we establish the zero order\nanalogue of the folklore lower bound for the decrease of the function:\nLemma 2 (Step-Convergence). Suppose that g0 is a (L, B, c)-well-behaved function for a (cid:96)-gradient\nLipschitz function f. If \u03b7 \u2264 1\n(cid:96) then we have that f (xk+1) \u2264 f (xk) \u2212 \u03b7\nGiven this lemma we can prove convergence to \ufb01rst order stationary points.\nTheorem 2 (Convergence to \ufb01rst order stationary points). Suppose that g0 is a (L, B, c)-well-\nbehaved function for a (cid:96)-gradient Lipschitz function f. Let \u03b7 \u2264 1\nB . Then if f is lower\nbounded limk\u2192\u221e(cid:107)\u2207f (xk)(cid:107) = 0.\nThe last theorem gives us a guarantee that the norm of the gradient is converging to zero but this\nis not enough to prove convergence to a single stationary point if f has non isolated critical points.\nIn the Appendix, we prove that if the gradient approximation error decreases quickly enough then\nconvergence to a single stationary point is guaranteed for analytic functions. This allows us to\nconclude our analysis with this \ufb01nal theorem.\nTheorem 3 (Convergence to minimizers). Let f : Rd \u2192 R \u2208 C 2 be a (cid:96)-gradient Lipschitz function.\nLet us also assume that f is analytic, has compact sub-level sets and all of its saddle points are strict.\nLet g0 be a (L, B, c)-well-behaved function for f with \u03b7 < min{ 1\nB . If we pick a\nL , 1\nrandom initialization point x0, then we have that for the xk iterates of g0\nk\u2192\u221e xk = x\u2217) = 1\n\n\u2200h0 \u2208 R : Pr( lim\n\n2(cid:96)} and \u03b2 < 1\u22122\u03b7(cid:96)\n\n(cid:96) , \u03b2 < 1\n\nwhere x\u2217 is a local minimizer of f.\n\n5 Escaping Saddle Points Ef\ufb01ciently\n\n5.1 Overview\n\nIn the previous subsections we provided suf\ufb01cient conditions for approximate gradient descent to\navoid strict saddle points. However, the stable manifold theorem guarantees that this will happen\nasymptotically. In fact, convergence could be quite slow until we reach a neighborhood of a local\nminimum. An analysis done for the \ufb01rst order case by [DJL+17] showed that avoiding saddle points\ncould take exponential time in the worst case. In this section, we will use ideas from the work of\n[JGN+17] in order to get a zero order algorithm that converges to SOSPs ef\ufb01ciently.\nConvergence to SOSPs poses unique challenges to zero order methods when it comes to controlling\nthe gradient approximation accuracy. For convergence to \ufb01rst order stationary points one can use\nproperty iii) of De\ufb01nition 4 and Lemma 2 to show that h = \u0001/c guarantees the decrease of f until\n(cid:107)\u2207f (xk)(cid:107) \u2264 \u0001. For SOSPs , this is not applicable as the norm of the gradient can become arbitrarily\nsmall near saddle points. One could resort to iteratively trying smaller h to \ufb01nd one that guarantees\n\n6\n\n\fthe decrease of f. A surprising fact about our algorithm is that even if the gradient is arbitrarily small,\ncomputationally burdensome searches for h can be totally avoided.\n\nc\n\nc\n\n2\n\nd\n\n\u03c73 \u00b7(cid:113) \u00013\n\n\u03c1\n\n5.2 Algorithm\nAlgorithm Initialization: ((cid:96), \u03c1, \u0001, c, \u03b4, \u2206f )\n1: \u03c7 \u2190 3 max{log( d(cid:96)\u2206f\n2:\n\nc\u00012\u03b4 ), 4}, \u03b7 \u2190 c\n\u03c1\u0001 , S \u2190 \u221a\n\n(cid:96) , r \u2190 \u221a\n\u221a\n\u03c1 , hlow \u2190 1\n\ntthres \u2190 \u03c7\nc2 \u00b7\n\nc\n\u03c7\n\n(cid:96)\u221a\n\n\u03c1\u0001\n\nch\n\n4ch\n\n(cid:96) , gthres \u2190 \u221a\n\u03c72 \u00b7 \u0001, fthres \u2190 c\n\u03c72 \u00b7 \u0001\nmin{gthres, r\u03c1\u03b4S\n}\n\u221a\nAlgorithm 2 EscapeSaddle (\u02c6x)\n1: \u03be \u223c Unif(B0(r))\n2: \u02dcx0 \u2190 \u02c6x + \u03be\n3: for i = 0, 1, . . . tthres do\n4:\n5:\n6:\n7:\n8: end for\n9: return \u02c6x\n\n4 gthres then\nxt+1 \u2190 xt \u2212 \u03b7zt\nxt+1 \u2190 EscapeSaddle (xt)\nif xt+1 = xt then return xt\n\nAlgorithm 1 PAGD(x0)\n1: for t = 0, 1, . . . do\nzt \u2190 q(xt, gthres\n2:\nif (cid:107)zt(cid:107) \u2265 3\n3:\n4:\nelse\n5:\n6:\n7:\n8:\n9: end for\nJust like [JGN+17], we will assume that f is (cid:96)\u2212gradient Lipschitz and also \u03c1\u2212Hessian Lipschitz. To\nconstruct a zero order algorithm we will also need a gradient approximator q : Rd \u00d7 R \u2192 Rd. We\nwill only require the error bound property on q, i.e., there exists a constant ch such that\n\nif f (\u02c6x) \u2212 f (\u02dcxi) \u2265 fthres then\nend if\n\u02dcxi+1 \u2190 \u02dcxi \u2212 \u03b7q(\u02dcxi, hlow)\n\nreturn \u02dcxi\n\nend if\n\n)\n\n\u2200x \u2208 Rd, h \u2208 R : (cid:107)q(x, h) \u2212 \u2207f (x)(cid:107) \u2264 ch|h|\n\nThe high level idea of Algorithm 1 is that given a point xt that is not an \u0001-SOSP the algorithm makes\nprogress by \ufb01nding a xt+1 where f (xt+1) is substantially smaller than f (xt). By the de\ufb01nition of\n\u0001-SOSPs either the gradient of f at xt is large or the Hessian has a substantially negative eigenvalue.\nSeparating these two cases is not as straightforward as in the \ufb01rst order case. Given the norm of the\napproximate gradient q(x, h), we only know that (cid:107)\u2207f (x)(cid:107) \u2208 (cid:107)q(x, h)(cid:107) \u00b1 ch|h|. In Algorithm 1\nby choosing 3gthres/4 as the threshold to test for and h = gthres/(4ch), we guarantee that in step 4\n(cid:107)\u2207f (xt)(cid:107) \u2265 gthres/2. This threshold is actually high enough to guarantee substantial decrease of f.\nIndeed given that we have a lower bound on the exact gradient and using Lemma 2 we get\n\nf (xt) \u2212 f (xt+1) \u2265 \u03b7\n2\n\n(cid:0)(cid:107)\u2207f (xt)(cid:107)2 \u2212 (cid:107)\u03b5t(cid:107)2(cid:1) \u2265 3\n\n32 \u03b7g2\n\nthres\n\nwhere \u03b5t is the gradient approximation error at xt. This decrease is the same as in the \ufb01rst order case\nup to constants.\nOn the other hand, in Algorithm 2 we are guaranteed that (cid:107)\u2207f (\u02c6x)(cid:107) \u2264 gthres. In this case our\napproximate gradient cannot guarantee a substantial decrease of f. However, we know that the\nHessian has a substantially negative eigenvalue and therefore a direction of steep decrease of f\nmust exist. The problem is that we do not know which direction has this property. In [JGN+17]\nit is proved that identifying this direction is not necessary for the \ufb01rst order case. Adding a small\nrandom perturbation to our current iterate (step 2) is enough so that with high probability we can get\na substantial decrease of f after at most tthres gradient descent steps (step 5). Of course this work is\nnot directly applicable to our case since we do not have access to exact gradients.\nThe work of [JGN+17] mainly depends on two arguments to provide its guarantees. The \ufb01rst\nargument is that if the \u02dcxi iterates do not achieve a decrease of fthres in tthres steps then they must\nremain con\ufb01ned in a small ball around \u02dcx0. Speci\ufb01cally for the exact gradient case we have that\n\n(cid:107)\u02dcxi \u2212 \u02dcx0(cid:107)2 \u2264 2\u03b7fthrestthres.\n\nThe zero order case is de\ufb01nitely more challenging since each update in Algorithm 2 is not guaranteed\nto decrease the value of f. Therefore, iterates may wander away from \u02dcx0 without even decreasing the\nfunction value of f. To amend this argument for the zero order case we require that hlow \u2264 gthres/ch.\nThis guarantees that even if gradient approximation errors amass over the iterations we will get the\nsame bound as the \ufb01rst order case up to constants.\n\n7\n\n\f\u221a\n\nThe second argument of [JGN+17] formalizes why the existence of a negative eigenvalue of the\nHessian is important. Let us run gradient descent starting from two points u0 and w0 such that\nw0 \u2212 u0 = \u03bae where e is the eigenvector corresponding to the most negative eigenvalue of the\nHessian and \u03ba \u2265 r\u03b4/(2\nd). Then at least one of the sequences {wi},{ui} is able to escape away\nfrom its starting point in tthres iterations and by the \ufb01rst argument it is also able to decrease the value of\nf substantially. The proof of the claim is based on creating a recurrence relationship on vi = wi \u2212 ui.\nThe corresponding recurrence relationship for the zero order case is more complicated with additional\nterms that correspond to the gradient approximation errors for wi and ui. However, we are able to\nprove that if hlow \u2264 r\u03c1\u03b4S/(2\nd) then these additional terms cannot distort the exponential growth of\nvi. Having extended both arguments of [JGN+17] we can establish the same guarantees for escaping\nsaddle points.\nTheorem 4 (Analysis of PAGD). There exists absolute constant cmax such that: if f is (cid:96)-gradient\n\u03c1 , \u2206f \u2265 f (x0) \u2212 f (cid:63), and constant\nLipschitz and \u03c1-Hessian Lipschitz, then for any \u03b4 > 0, \u0001 \u2264 (cid:96)2\nc \u2264 cmax, with probability 1 \u2212 \u03b4, the output of PAGD(x0, (cid:96), \u03c1, \u0001, c, \u03b4, \u2206f ) will be an \u0001-SOSP , and\n(cid:19)(cid:19)\nhave the following number of iterations until termination:\n\n(cid:18) (cid:96)(f (x0) \u2212 f (cid:63))\n\n(cid:18) d(cid:96)\u2206f\n\n\u221a\n\nO\n\n\u00012\n\nlog4\n\n\u00012\u03b4\n\n6 Experiments\n\nIn this section we use simulations to verify our theoretical \ufb01ndings. Speci\ufb01cally we are interested in\nverifying if zero order methods can avoid saddle points as ef\ufb01ciently as \ufb01rst order methods. To do this\nwe use the two dimensional Rastrigin function, a popular benchmark in the non-convex optimization\nliterature. This function exhibits several strict saddle points so it will be an adequate benchmark for\nour case. The two dimensional Rastrigin function can be de\ufb01ned as\n\nRas(x1, x2) = 20 + x2\n\n1 \u2212 10 cos(2\u03c0x1) + x2\n\n2 \u2212 10 cos(2\u03c0x2).\n\nFor this experiment we selected 75 points randomly from [\u22121.5, 1.5] \u00d7 [\u22121, 5, 1.5]. In this domain\nthe Rastrigin function is (cid:96)-gradient Lipschitz with (cid:96) \u2248 63.33. Using these points as initialization\nwe run gradient descent and the approximate gradient descent dynamical system we introduced in\nSection 4.2. For both gradient descent and approximate gradient descent we used \u03b7 = 1/(4(cid:96)). Then\nfor approximate gradient descent we used symmetric differences to approximate the gradients and\n\u03b2 = 0.95 as well as h0 = 0.15. Figure 1 shows the contour plot of the Rastrigin function as well\n\nFigure 1: Contour plots of the Rastrigin function along with the evolution of the iterates of gradient\ndescent and approximate gradient descent. Green points correspond to gradient descent whereas cyan\npoints correspond to approximate gradient descent.\n\nas the evolution of the iterates of both methods. As expected, for points initialized closed to local\nminima of the function convergence is quite fast. On the other hand, points starting close to saddle\npoints of the Rastrigin function take some more time to converge to minima. However, it is clear that\nin both cases the behaviour of gradient descent and approximate gradient descent is similar in the\nsense that for the same initialization there is no discrepancy in terms of convergence speed for the\ntwo methods.\nWe also want to experimentally verify the performance of PAGD. To do this we use the octopus\nfunction proposed by [DJL+17]. This function is is particularly relevant to our setting as it possesses\n\n8\n\n1.51.00.50.00.51.01.51.51.00.50.00.51.01.5Intial Points1.51.00.50.00.51.01.51.51.00.50.00.51.01.5Iteration 21.51.00.50.00.51.01.51.51.00.50.00.51.01.5Iteration 41.51.00.50.00.51.01.51.51.00.50.00.51.01.5Iteration 6051015202530354045\fa sequence of saddle points. The authors of [DJL+17] proved that for this function gradient descent\nneeds exponential time to avoid saddle points before converging to a local minimum. In contrast the\nperturbed version of gradient descent (PGD) of [JGN+17] does not suffer from the same limitation.\nBased on the results of Theorem 4, we expect PAGD to not have this limitation as well. We compare\ngradient descent (GD), PGD, AGD and PAGD on an octopus function of d = 15 dimensions. Figure 2\nclearly shows that the zero order versions have the same iteration performance with the \ufb01rst-order\nones. In fact, AGD is shown to behave even better than GD in this example thanks to the noise\ninduced by the gradient approximation.\n\nFigure 2: Octopus function value varying the number of iterations. Parameters of the function \u03c4 = e,\nL = e, \u03b3 = 1. Parameters of \ufb01rst order methods taken from [DJL+17]. Zero order methods use\nsymmetric differencing with h = 0.01\n\n7 Conclusion\n\nThis paper is the \ufb01rst one to establish that zero order methods can avoid saddle points ef\ufb01ciently. To\nachieve this we went beyond smoothing arguments used in prior work and studied the effect of the\ngradient approximation error on \ufb01rst order methods that converge to second order stationary points.\nOne important open question for future work is whether similar guarantees can be established for\nother zero order methods used in practice like direct search methods and trust region methods using\nlinear models. Another generalization of interest would be to consider the performance of zero order\nmethods for instances of (non-convex) constrained optimization.\n\nAcknowledgements\n\nGeorgios Piliouras acknowledges MOE AcRF Tier 2 Grant 2016-T2-1-170, grant PIE-SGP-AI-2018-\n01 and NRF 2018 Fellowship NRF-NRFF2018-07. Emmanouil-Vasileios Vlatakis-Gkaragkounis\nwas supported by NSF CCF-1563155, NSF CCF-1814873, NSF CCF-1703925, NSF CCF-1763970.\nWe are grateful to Alexandros Potamianos for bringing this problem to our attention, and for helpful\ndiscussions at an early stage of this project for its connection to Natural Language Processing tasks.\nFinally, this work was supported by the Onassis Foundation - Scholarship ID: F ZN 010-1/2017-2018.\n\n9\n\n02004006008001000Iterations2000150010005000f(xk)GDPGDAGDPAGD\fReferences\n[AAB+17] Naman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma.\nFinding approximate local minima faster than gradient descent. In Proceedings of the\n49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal,\nQC, Canada, June 19-23, 2017, pages 1195\u20131199, 2017.\n\n[ADX10] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex\noptimization with multi-point bandit feedback. In COLT 2010 - The 23rd Conference\non Learning Theory, Haifa, Israel, June 27-29, 2010, pages 28\u201340, 2010.\n\n[AL18] Zeyuan Allen-Zhu and Yuanzhi Li. NEON2: \ufb01nding local minima via \ufb01rst-order\noracles. In Advances in Neural Information Processing Systems 31: Annual Conference\non Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018,\nMontr\u00e9al, Canada., pages 3720\u20133730, 2018.\n\n[BG18] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochas-\ntic optimization via conditional gradient and gradient updates. In Advances in Neural\nInformation Processing Systems 31: Annual Conference on Neural Information Pro-\ncessing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages\n3459\u20133468, 2018.\n\n[CD16] Yair Carmon and John C. Duchi. Gradient descent ef\ufb01ciently \ufb01nds the cubic-regularized\n\nnon-convex newton step. CoRR, abs/1612.00547, 2016.\n\n[CG19] Tianyi Chen and Georgios B. Giannakis. Bandit convex optimization for scalable and\n\ndynamic iot management. IEEE Internet of Things Journal, 6(1):1276\u20131286, 2019.\n\n[CHM+15] Anna Choromanska, Mikael Henaff, Micha\u00ebl Mathieu, G\u00e9rard Ben Arous, and Yann\nLeCun. The loss surfaces of multilayer networks. In Proceedings of the Eighteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2015, San\nDiego, California, USA, May 9-12, 2015, 2015.\n\n[CRS+18] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E. Turner, and\nAdrian Weller. Structured evolution with compact architectures for scalable policy\noptimization. In Proceedings of the 35th International Conference on Machine Learning,\nICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages 969\u2013977,\n2018.\n\n[CSV09] Andrew R. Conn, Katya Scheinberg, and Lu\u00eds N. Vicente. Global convergence of general\nderivative-free trust-region algorithms to \ufb01rst- and second-order critical points. SIAM\nJournal on Optimization, 20(1):387\u2013415, 2009.\n\n[CZS+17] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. ZOO: zeroth\norder optimization based black-box attacks to deep neural networks without training\nsubstitute models. In Proceedings of the 10th ACM Workshop on Arti\ufb01cial Intelligence\nand Security, AISec@CCS 2017, Dallas, TX, USA, November 3, 2017, pages 15\u201326,\n2017.\n\n[DJL+17] Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnab\u00e1s P\u00f3czos.\nGradient descent can take exponential time to escape saddle points. In Advances in\nNeural Information Processing Systems 30: Annual Conference on Neural Information\nProcessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1067\u20131077,\n2017.\n\n[DJWW15] John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal\nrates for zero-order convex optimization: The power of two function evaluations. IEEE\nTrans. Information Theory, 61(5):2788\u20132806, 2015.\n\n[DPG+14] Yann N. Dauphin, Razvan Pascanu, \u00c7aglar G\u00fcl\u00e7ehre, KyungHyun Cho, Surya Gan-\nguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-\ndimensional non-convex optimization. In Advances in Neural Information Processing\nSystems 27: Annual Conference on Neural Information Processing Systems 2014, De-\ncember 8-13 2014, Montreal, Quebec, Canada, pages 2933\u20132941, 2014.\n\n10\n\n\f[GHH16] Bin Gu, Zhouyuan Huo, and Heng Huang. Zeroth-order asynchronous doubly stochastic\n\nalgorithm with variance reduction. arXiv preprint arXiv:1612.01425, 2016.\n\n[GHJY15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online\nstochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on\nLearning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 797\u2013842, 2015.\n\n[GL13] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst- and zeroth-order methods for\nnonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368,\n2013.\n\n[HZ18] Davood Hajinezhad and Michael M. Zavlanos. Gradient-free multi-agent nonconvex\nnonsmooth optimization. In 57th IEEE Conference on Decision and Control, CDC 2018,\nMiami, FL, USA, December 17-19, 2018, pages 4939\u20134944, 2018.\n\n[JGN+17] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How\nto escape saddle points ef\ufb01ciently. In Proceedings of the 34th International Conference\non Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages\n1724\u20131732, 2017.\n\n[JLGJ18] Chi Jin, Lydia T. Liu, Rong Ge, and Michael I. Jordan. On the local minima of the\nempirical risk. In Advances in Neural Information Processing Systems 31: Annual Con-\nference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December\n2018, Montr\u00e9al, Canada., pages 4901\u20134910, 2018.\n\n[JNJ18] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes\nsaddle points faster than gradient descent. In S\u00e9bastien Bubeck, Vianney Perchet, and\nPhilippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory,\nvolume 75 of Proceedings of Machine Learning Research, pages 1042\u20131085. PMLR,\n06\u201309 Jul 2018.\n\n[LCCH18] Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred Hero. Zeroth-order online alternating\ndirection method of multipliers: Convergence analysis and applications. In International\nConference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018,\nPlaya Blanca, Lanzarote, Canary Islands, Spain, pages 288\u2013297, 2018.\n\n[Lev16] K\ufb01r Y. Levy. The power of normalization: Faster evasion of saddle points. CoRR,\n\nabs/1611.04831, 2016.\n\n[LKC+18] Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Pai-Shun Ting, Shiyu Chang, and Lisa Amini.\nZeroth-order stochastic variance reduction for nonconvex optimization. In Advances in\nNeural Information Processing Systems 31: Annual Conference on Neural Information\nProcessing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages\n3731\u20133741, 2018.\n\n[LPP+19] Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan,\nand Benjamin Recht. First-order methods almost always avoid strict saddle points. Math.\nProgram., 176(1-2):311\u2013337, 2019.\n\n[LZH+16] Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensive\nlinear speedup analysis for asynchronous stochastic parallel optimization from zeroth-\norder to \ufb01rst-order. In Advances in Neural Information Processing Systems 29: Annual\nConference on Neural Information Processing Systems 2016, December 5-10, 2016,\nBarcelona, Spain, pages 3054\u20133062, 2016.\n\n[MK87] Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and\n\nnonlinear programming. Math. Program., 39(2):117\u2013129, 1987.\n\n[MMS+18] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian\nVladu. Towards deep learning models resistant to adversarial attacks. In 6th International\nConference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30\n- May 3, 2018, Conference Track Proceedings, 2018.\n\n11\n\n\f[NP06] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its\n\nglobal performance. Math. Program., 108(1):177\u2013205, 2006.\n\n[NS17] Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex\n\nfunctions. Foundations of Computational Mathematics, 17(2):527\u2013566, 2017.\n\n[PMG+17] Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay\nCelik, and Ananthram Swami. Practical black-box attacks against machine learning. In\nProceedings of the 2017 ACM on Asia Conference on Computer and Communications\nSecurity, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 2-6, 2017, pages\n506\u2013519, 2017.\n\n[RK16] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method,\n\nvolume 10. John Wiley & Sons, 2016.\n\n[SHCS17] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a\n\nscalable alternative to reinforcement learning. CoRR, abs/1703.03864, 2017.\n\n[Shu87] Michael Shub. Global stability of dynamical systems. Springer Science & Business\n\nMedia, 1987.\n\n[SLA12] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of\nmachine learning algorithms. In Advances in Neural Information Processing Systems 25:\n26th Annual Conference on Neural Information Processing Systems 2012. Proceedings\nof a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages\n2960\u20132968, 2012.\n\n[Spa03] James C. Spall. Introduction to Stochastic Search and Optimization. John Wiley & Sons,\n\nInc., New York, NY, USA, 1 edition, 2003.\n\n[SQW18] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations\n\nof Computational Mathematics, 18(5):1131\u20131198, 2018.\n\n[WDBS18] Yining Wang, Simon S. Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic\nzeroth-order optimization in high dimensions. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote,\nCanary Islands, Spain, pages 1356\u20131365, 2018.\n\n[WJ08] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families,\nand variational inference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305,\n2008.\n\n12\n\n\f", "award": [], "sourceid": 5317, "authors": [{"given_name": "Emmanouil-Vasileios", "family_name": "Vlatakis-Gkaragkounis", "institution": "Columbia University"}, {"given_name": "Lampros", "family_name": "Flokas", "institution": "Columbia University"}, {"given_name": "Georgios", "family_name": "Piliouras", "institution": "Singapore University of Technology and Design"}]}