Review for NeurIPS paper: BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization

NeurIPS 2020

BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization

Review 1

Summary and Contributions: EDIT: I've read the rebuttal. ----------------------------------------------------------------------------- This work presents a Bayesian optimization library -- BOTorch, which has advantages over other existing BO libraries: * A novel approach to optimize MC acquisition functions using fixed sample averages. This leads to much faster and easier implementation of second-order methods and look-ahead BO methods (e.g., Knowledge gradient). * Novel general convergence results for sample-average approximation to acquisition functions via randomized quasi-MC.

Strengths: * This BO library is largely distinct from existing ones due to its approach to MC acquisition functions --- a fixed sample average is used and optimized throughout the whole process. This allows approximate 2nd-order methods such as L-BFGS to be applied stably, which is not the case if samples are repeatedly drawn to evaluate the expectation, as done in stochastic gradient methods. * A wide range of experiments are conducted to demonstrate the performance improvement and the results are convincing compared to multiple baselines (implemented in the same and other libraries). * Using this approach, BOTorch provides a fast implementation of the Knowledge gradient method: one-shot KG, which gets rid of inner optimization loops. This seems to be a significant contribution to the KG algorithm and has potential applications in other look-ahead BO methods.

Weaknesses: * As a software paper, probably due to space limits, section 5 is too brief for readers to understand the main concepts in the library, and most features described there aren't illustrated with code. As a result, it is very difficult to understand the code segments provided in the text. * L179: BOTORCH provides a @concatenate_pending_points decorator to 180 add this functionality to any MC acquisition function. I don't think the mechanism of parallel and asynchronous BO in BOTorch is described somewhere in the paper.

Correctness: Overall the paper is technically sound, although I did not check the proof of convergence results (the proof is directly adapted from Homem-de-Mello (2008).

Clarity: The methodology is clear. The clarity can be improved if the library feature description is paired with code examples.

Relation to Prior Work: The comparison to GPFlowOpt can be more properly executed. * L50: GPFlowOpt inherits support for auto-differentiation and hardware acceleration from TensorFlow [via GPFlow, 60], but unlike BOTORCH, it does not use algorithms designed to specifically exploit this potential. Can you be more specific about what the potential you are referring to here?

Reproducibility: Yes

Additional Feedback: * It is not required but, since this is a known open-source library from a big company, should the broader impact section focus more on how it will be deployed in products and the potential negative impacts?

Review 2

Summary and Contributions: The authors introduce BoTorch, a framework for BO. BoTorch is a collection of useful techniques that make BO efficient, including MC acquisition functions, SAA optimization, auto-differentiation, etc. BoTorch facilitates the specification of new acquisition functions. Theoretical convergence results are provided for the SAA BO approach and for the One-shot KG formulation.

Strengths: This paper is highly relevant for the NeurIPS community and tackles the important problem of Bayesian Optimization (BO). The claims are sound, significant and novel. From a technical perspective the authors put a lot of effort in implementing a well-thought framework for BO that is realised open-source. This is by itself a great technical contribution that will greatly benefit the BO community. From the theoretical perspective the idea of using SAA and the theoretical garantees are of important value and given the evidence provided in the paper score well compared to stochastic approaches. The real added value of SAA is transforming the optimization into a deterministic optimization setting where standard optimization techniques can be used. The One-shot formulation of KG and its theoretical result is also relevant to the BO community given the importance of the KG method. The novel techniques have been used in a series of satisfying experiments.

Weaknesses: Perhaps my main remark is that the paper could have been structured differently. The format of the paper is unconventional, i.e., the theoretical results and the experiments are mixed. I'd suggest to use a more conventional writing style to benefit the reader. The background section can be extended to make the reader more confortable with some of the concept that are heavily used in the rest of the text, e.g., fantasies, KG. While some of these are briefly introduced later in the text this makes it difficult to separate the contributions from the background and again it gives the paper a unstructured feeling. Minor remarks: Line 69: the authors say that the f_D(x) and y_D(x) are multivariate normal. Since they are uncountably infinite they are not proper multivariate normal distributions. They are Gaussian Processes instead. The same non-rigorous description of the posterior is at line 160. Line 160: "be be" Figure 6: It is my understanding that this batch parallelism is used to optimize the acquisition function. To properly evaluate the acquisition function parallelization wall-clock time the query of the GP is also to be factored in. This information is missing in the chart. Are you excluding the GP queries from the y axis? Why is the Hartmann6 function used for the feasibility constraints experiments and not some more common functions that come equipped with feasibility constriants? Why the comparison is only against random sampling? Line 338: "we give an specific example"

Correctness: The claim, methods and empirical methodology are correct.

Clarity: See above, I believe that the paper would benefit from some restructuring of the section following a more conventional NeurIPS writing style.

Relation to Prior Work: This is clear enough even if writing some of the sections in the background would make differentiating the contributions from the background easier to understand and clearer to varying level of readers.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This article presents result on the use of Sample Average Approximation for Bayesian optimization's acquisition functions in Monte Carlo form. In addition, a new way to compute the knowledge gradient acquisition function is proposed. Apart from these methodological and theoretical results, the paper details the capabilities of the BOTorch implementation.

Strengths: The paper provides an overview of available BO softwares, while mostly on BOTorch.

Weaknesses: Focusing a lot on the software part hides the methodological contributions (that are orthogonal), and is possibly uncommon. Assessments could be more balanced, and the evaluation of new results more exhaustive.

Correctness: The list of possible competitors is not complete (e.g., GPyFlowOpt is missing), it would be fairer to conduct an open competition. In fact there is a BO competition in NeurIPS 2020 Competition Track.

Clarity: The paper is easy to follow, but is perhaps too much focused on the software part.

Relation to Prior Work: The literature review is extensive.

Reproducibility: Yes

Additional Feedback: The benchmark results are relatively limited. More extensive tests could be performed e.g., with the COCO platform (Hansen, N., Auger, A., Mersmann, O., Tusar, T., & Brockhoff, D. (2016). COCO: A platform for comparing continuous optimizers in a black-box setting. arXiv preprint arXiv:1603.08785.) Another potential metric of interest is to show convergence based on wall time and not just the number of iterations. Efforts in installing GPflowOpt could be made, possibly by contacting the authors. Similarly many R packages provide BO capabilities. The drawbacks of the new KG formulation should be discussed. Overall a more balanced discussion on the methodology would be beneficial, to feel less promotional. *** Post-rebuttal comment *** I thank the authors for their response to my comments. As mentioned in the reviews, the mix of software, methodological and theoretical contributions in only 8 pages does not work very well. The additional 9th page may help a bit but not completely. For this reason, the paper is --in my opinion-- more suited for a software journal. The software capabilities are appealing, but the widespread use of this software will depend on many factors. I increased my score accordingly.

Review 4

Summary and Contributions: This work suggests the a sample average approach to the MC estimation of acquisition functions in Bayesian optimisation allowing the MC approximation of the acquisition function to be moved outside of the BO procedure, allowing for deterministic optimisers to be applied on the acquisition step. They provide theoretical convergence guarantees of this approach, and provide a complete software package to perform BO.

Strengths: The paper introduces a sample averaged approach to the MC approximation of intractable acquisition functions that allows for higher order optimisers to be used in BO instead of the first order stochastic gradient type approaches more traditionally used, and allows for the use of RQMC methods for variance reduction, these methods lead to faster convergence speeds and this is further supported by empirical results. The proposed SAA method is further supported by theoretical justification supporting the convergence of the approximated acquisition function to the true function, and the convergence of the maximiser to the true value along with an exponential convergence rate. The benefit of the SAA approach is well demonstrated in Section 6 where it is used to perform the evaluation of particularly challenging look-ahead acquisition functions such the KG acquisition functions. The method is accompanied by a well-designed modular software package that does have the potential to facilitate future research into Bayesian optimisation and acquisition functions which could prove useful to the BO community. Comparisons with serval existing software packages in this same area demonstrates that the introduced method provides performance gains over these implementations.

Weaknesses: This work provides two contributions, the advocation of the SAA of the acquisition function, and a software package to implement this approach, as well as other BO methods. While both aspects have their positives as discussed above it is not entirely clear that either aspect on its own is necessarily strong enough for acceptance. The authors mention many existing software packages that can implement the same models, and while BoTorch seems to be well written software with an appealing modular design and extensibility, it is not clear that this necessarily offers something fundamentally new to the ML community, even if it may prove useful. While the authors do argue that existing implementations lack their modularity or hardware acceleration, the relatively simple structure of the BO algorithm does mean that it can be implemented with relative ease using a functional approach inside of a hardware accelerated framework. Similar remarks hold when attempting to independently asses the contribution of the SAA approach, while there is some evidence of the benefits of this approach in Fig. 3, this figure is barely discussed in the main body of the paper. Again I would stress my generally positive opinions of the paper, but the presentation and frequent references to the appendix make distinguishing the standout contributions hard.

Correctness: The theoretical and methodological claims seem correct

Clarity: While in general sections of the paper are well written the overall paper does not seem to have been well condensed to the page limit as far too frequently references are made to material in the appendices. This significantly hurts the readability and flow of the paper, and importantly makes it harder to identify the standout contributions of the paper that were felt most important of inclusion in the main body. Additional minor comment, it is perhaps confusing to have $\alpha$ be both the general notation used for the acquisition function, and the exponential rate constant in Theorem 1.

Relation to Prior Work: Yes -- while there are numerous pre-existing software packages for BO the work in this paper has discussed these in good detail highlighting the main differences.

Reproducibility: Yes

Additional Feedback: