NeurIPS 2020

### Review 1

Strengths: This paper provides improved bounds for zero-th order stochastic gradient descent for the case where the function is alpha-strongly convex and \beta Holder smoothe. While the algorithm is based on existing literature, (with derivative free stochastic optimization dating back to Nemirovski), they provide novel bounds with explicit dependence on the relevant parameters d, \alpha, and \beta, unlike prior work which only shows explicit dependence on the number of iterations. In particular, the bounds that they show include: - upper bound on the optimization complexity - upper bound on the regret - upper bound on estimation error of the minimal value - a minimax lower bound on the optimization complexity for the case of independent noise (which matches their provided upper bound up to a factor of d) Moreover, the ability to obtain bounds on estimating the minimizer of the function at a rate of 1 / \sqrt{T} , which is the same statistical complexity one would obtain from just querying the minimizer over independent noise is both surprising and an interesting result. I believe that these contributions are of interest to the NeurIPS community, namely due to the importance of understanding zero-th order optimization in various machine learning tasks. This work serves at a starting point for understanding the impact of higher-order smoothness conditions on zero-th order optimization.

Weaknesses: The authors provide no discussion on the broader impacts. Clearly zero-th order optimization is very important for various stochastic bandit tasks, as many problems (e.g. optimizing complex chemical systems) have function values which are more readily computable than their gradient information. However, the authors provide no motivation, background / interesting problems to help put their work in perspective. Especially the additional assumption of strong convexity (which allows them to get guarantees which scale linearly with respect to the dimension instead of exponentially) is a BIG additional assumption in comparison to bandit literature, and there was no discussion / motivating problems as to when situations like this might arise. In addition, some of the related work section mostly concerns the optimization perspective, and ignores a large amount of related work considering exploiting smoothness in the stochastic contextual bandit literature. More discussion on this point is in the related work section of the review.

Correctness: The technical contributions of the papers appear correct, subject to clarity of presentation (see question 5/8 below). However, several of the proofs are very technical and inside of the appendix adding a brief summary or discussion of the result before stating them would help with clarity. Moreover, there are instances in the proof where theorems from prior work are utilized with no explicit discussion or clarity as to what the result is or where and how it fits in the proof. This made some of the proofs difficult to follow. Lastly in several cases the authors state that the proofs extend to the case when the function f_t is allowed to depend on the iteration number instead of a fixed unknown function. Adding more discussion on this in the appendix, and potentially state the more general bound under this assumption would help clear those points up.

Relation to Prior Work: The authors provide a very detailed related work to summarize the differences in their approach to existing zero-th order optimization based approaches. In particular, most of the prior work considers settings where the noise is deterministic or Gaussian, in which they consider a general setting by allowing for adversarial noises. Moreover, they provide explicit dependence on the bounds in terms of the strongly convex parameter alpha and the dimension d, whereas prior work just showed bounds with an unspecified constant proportional to the number of iterations. However, as the contributions are technical, highlighting the key proof difference will help clarify their result against prior work. On the counter-side, exploiting smoothness in the bandit literature (without the convexity assumption) has been studied before, e.g. "Smooth Contextual Bandits: Bridging the Parametric and Non-differentiable Regret Regimes" by Yichun Hu, Nathan Kallus, Xiaojie Mao 2019. "Nonparametric bandits with covariates". Philippe Rigollet and Assaf Zeevi, 2010 "The Multi-Armed Bandit Problem with Covariates". Vianney Perchet and Philippe Rigollet. 2013 Adding more discussion on the related work in stochastic bandits would help motivate and position this work in the broader context of stochastic multi-armed bandits (and especially by differentiating by adding the additional strong-convexity asumption which results in bounds which are not exponential in the dimension).

Reproducibility: Yes

Additional Feedback: Please note that the following comments are taken with the main paper from the main submission, and the appendix taken from the supplementary. - The discussion in line 180 is an interesting point, that the algorithm is able to obtain matching mini-max bounds of the form 1 / \sqrt{T} estimate for the maximizer under additive noise even for the case when the true minimizer of the function is unknown. - The discussion before Theorem 5.1 helped make the result more readable and interpretable, and adding a brief discussion akin to this before each result would help with the clarity of presentation. - In line 95 you express various assumptions on the kernels that are used. Can you give examples of the related \kappa_\beta parameters for the kernels you state satisfy the assumptions (e.g. the weighted legendre polynomials)? - Each of the algorithms requires tuning parameters under the assumption that \alpha, \beta, lipschitz parameter, etc are all known in advanced. Are you able to show asymptotic guarantees in the case when these values are unknown? - In line 430 it seems as though you dropped the factor from multiplying by K^2(r) which gives the \kappa term in the front. - Prior to stating Lemma A.2 it would be helpful to redefine the function \hat{f}_t - For Appendix Section C it might be helpful to summarize some of the results presented in the paper [3] that you are referring to for readers who are unfamiliar with the work.

### Review 2

Summary and Contributions: This paper studies the problem of zeroth order optimization of a strongly convex function, a problem already studied by Bach and Perchet. They analyze the algorithm of Bach and Perchet under additional assumptions, namely high-order smoothness (using a Hölder-like condition to quantify the smoothness) and derive bounds on the cumulative regret and optimization error in this setting. They also prove minimax lower bounds on this problem.

Strengths: The claims are all proved and the proofs seem technically correct. I appreciated to see that all constants are explicit in the statements of the theorems, even if this does not improve the readability of the paper. I also appreciated to see sketches of the proofs of the main theorems in the main text.

Weaknesses: I am unsure of the significance of this contribution. This paper highly builds on the previous algorithm of Bach and Perchet. The contribution of this work is the analysis of this algorithm under the high-order smoothness assumption, which does not seem especially interesting to me. On top of that, I am under the impression that the proofs are derived in the setting where all the parameters are known (strong convexity constant, smoothness parameters, sigma, etc) which does not seem realistic. I would have been interested in seeing numerical experiments. How can this algorithm be implemented in practice ? And how do you ensure that its implementation achieves the same bounds, without knowing all parameters ?

Correctness: Yes. In order to strengthen the claims that their analysis outperforms previous ones, it would have been interesting to add experiments.

Clarity: The paper is globally well-written despite some typos (line 83, "estimated" for example). However I did not feel at ease with the introduction. The authors do not really motivate their work and begin very quickly with technicalities and explanations about the algorithm in the first page. I think that the related work section could have been put in the beginning of the paper.

Relation to Prior Work: Except from its place at the end of the paper, I found the relative work section clear and well-written. The authors discuss clearly their differences with previous contributions. Appendix C is also relevant in this direction.

Reproducibility: No

Additional Feedback: A direction for improvement could be the implementation of the algorithm in a practical setting, or a discussion on the fact that all parameters have to be known to derive the bounds of the theorems. How could you achieve these bounds without knowing the parameters ? ### AFTER REBUTTAL ### I have read the rebuttal and the answers from the authors. They did not convince me of the significance of their work, which is still interesting from a mathematical point of view. I will therefore stick with my initial score.

### Review 3

Summary and Contributions: The authors study the zeroth-other convex optimization problem and study a previous algorithm of Bach and Perchet under additional assumptions that the target function is beta-Holder smooth and alpha-strongly convex. The authors prove upper bounds on the regret when the only assumption on the noise is that the variance is bounded (though the algorithm needs to be turned with knowledge of this variance). The authors also provide a guarantee for estimation of f(x^*) for a modified algorithm that uses a third query to estimate the function value. The authors provide stronger convergence guarantees in the beta=2 case (exploiting the connection to the surrogate function), as well as lower bounds for arbitrary beta>=2.

Strengths: Baring my being ignorant of large part of this literature, the technical contributions seem strong. In particular, the algorithms analyzed are simple, intuitive, and computationally trivial. Pointers to improvements from the literature (such as the log(T) improvement taking a tail average would yield) are appreciated. Presenting a lower bound really strengthens the upper bounds in the paper and provides a more holistic picture of the algorithm, and the paper seems to provide a satisfactory addition to our understanding on how smoothness affects rates. Section 7 contextualizes the results very well. Given the fundamental nature of zeroth order approximation, I believe the results in this paper would be very relevant to the community.

Weaknesses: Sometimes the presentation is dense: a table, for example, would be a more efficient way to compare the derived rates with past results. There are a few discoveries I wish the authors would discuss a bit more, including: - the generalization to "adversarial noise;" e.g. explain why this generalization is plausible. - showing the bias-variance decomposition explicitly, at least one, would be nice - can you explain why the kernel is redundant when beta=2 (line 204)? - Since the claimed lower bound is novel, can you explain what is new about the construction?

Correctness: The proofs seem reasonable, but I didn't check them.

Clarity: Generally yes, but see some suggestions in 'weaknesses' above.'

Relation to Prior Work: Extensively.

Reproducibility: Yes

Additional Feedback: Why does def 2.1 need to constrain zeta and r? They are chosen by the learner. # post rebuttal: My questions were mostly addressed, though I still think def 2.1 is unnatural. I will keep my score.

### Review 4

Summary and Contributions: The authors focus on the problem of zero-order optimization of a strongly convex function, in order to find the minimizer of the function via a sequential approach. They focus on the impact of higher-order smoothness on both the optimization error as well we on the cumulative regret. They consider a randomized approximation of the projected gradient descent algorithm, where the gradient is estimated via two function evaluations and a smoothing kernel. Several theoretical results are derived under different settings. Their results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and problem parameters. They also provide an estimator of the minimum function value which achieves sharp oracle behavior.

Strengths: EDIT: I thank the authors for their response, which have addressed my questions. I leave my evaluation unchanged. ---------------------------------------------------------------------------------------------- The paper is very well written and is a pleasure to read. The main strengths are as follows: 1. Very clear problem statement and clean highlights of the contributions 2. The theoretical proofs seem sound (Unfortunately, I was not able to go through the detailed proofs, but the proof sketch seems correct and clearly highlights the steps needed to prove the claims. 3. The paper is highly novel as it improves on previous literature and considers settings that are much more general than before. Their result on the lower bound is very novel. 4. This is of high relevance to the NeurIPS community

Weaknesses: I didn't find any strong weakness in the paper but it might be good to add some empirical results on some functions to visually see how these bounds behave in practice.

Correctness: The sketch of the proof seem correct. The paper did not have any empirical evidence.

Clarity: It is very well written and is easy to follow. Few minor suggestions. 1. For the claim in line 180 and 181 can the authors add a reference? 2. Type in line 206: x -> x_t in defining y_t and y_t'

Relation to Prior Work: The related works section is excellent and clearly highlights how this work differs from previous contributions.

Reproducibility: Yes