Review for NeurIPS paper: Diversity-Guided Multi-Objective Bayesian Optimization With Batch Evaluations

NeurIPS 2020

Diversity-Guided Multi-Objective Bayesian Optimization With Batch Evaluations

Review 1

Summary and Contributions: The paper proposes a batched Bayesian multi-objective optimization method that uses diversity in the design and performance space during sampling to balance exploration and exploitation. The key contribution is the batch selection strategy that is guided by diversity measure and hypervolume improvement metrics. The diversity measure is obtained via grouping candidate points into diversity regions based on a performance buffer data structure similar to graph-cut. The actual selection is then just a greedy approach that tries to maximize both diversity and hypervolume improvement. The paper presents experimental results that show superior performance both quantitatively (hypervolume indicator) and qualitatively over several synthetic and real-world datasets.

Strengths: The proposed algorithm seems sound, diversity in sampling is known to help avoiding local minima problems, this paper leverages this in the MOBO domain quite efficiently. I think the paper adds valuable contribution to the batch-MOBO domain. The authors propose several simplifications that make the runtime of the method in the same league as the related ones (only few iterations in Pareto front approximation, greedy selection). Besides the superior performance there are several ablation studies provided over the batch size showing robust performance and algorithms flexibility.

Weaknesses: The proposed method builds upon a lot of previously well-studied elements, and I only find the diversity technique in the sampling process as somewhat novel. Still, it introduces quite some extra calculations, in fact only ParEGO and MOEA are slower than that. These algorithms are however not primed for batches so the comparison is questionable. I am also wondering how the grouping method's buffer capacity impacts the quality of the results

Correctness: Yes, the paper clearly describes the method via algorithm sections that are easy to understand.

Clarity: The paper is well-written, easy to follow. I found the math solid and Figure 1 particularly useful in understanding how the proposed method differs from related ones.

Relation to Prior Work: The paper does a good job going over related work in the MOBO domain and is does notable efforts to distinguish its contributions.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper considers the problem of black-box optimization of multiple objectives using a batch of inputs for function evaluation in each iteration of Bayesian optimization (BO). The paper provides an approach for selecting a batch of inputs by balancing diversity and potential improvement in Pareto set of solutions. There are three key steps in the algoithm: 1) Solve a cheap multi-objective optimization problem using the mean of GP posterior for each black-box function using a recent local search style approach 2) Identify linear subspaces as diverse regions by taking into account accuracy and neighborhood information in the input space. This step is tied to local search method in step #1. 3) A greedy algorithm to maximize the hypervolume improvement by selecting inputs from diverse regions uniformly. Experiments are performed on both synthetic and real-world benchmarks with good results.

Strengths: + Considers a relatively less-studied problem. + Extremely well-written paper. + Good experimental results.

Weaknesses: - Technical novelty is low. - Experimental evaluation is lacking to some extent.

Correctness: Claims and method are correct. Empirical methodology is correct, but evaluation can be improved.

Clarity: Extremely well-written paper except for a small part.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Comments: 1. This is a very well-written and well-executed paper on a relatively less-studied problem. 2. Technical novelty is on the lower side. Step #2 is tied to the local search method in Step #1 and is the core technical innovation. 3. There are many existing methods from batch BO in single-objective setting, which can be easily used as wrappers on top of any given multi-objective acquisition function. Some examples: Local penalization (https://arxiv.org/abs/1505.08052) Determinental Point Processes (DPPs) Stein method (http://proceedings.mlr.press/v97/gong19b/gong19b.pdf) I would have liked to see some experimental results for some instantiation of the above methods (e.g., local penalization is simple to implement) for multi-objective BO. 4. The exposition for step #2 can be improved to make the paper self-contained. In its current form, the text is too high-level and all the details are not clear. Post rebuttal: Thanks for the detailed response. I strongly encourage to include a baseline from my comment #3. As I mentioned, local penalization method is simple to implement and test.

Review 3

Summary and Contributions: The article describes a new algorithm for conducting multiobjective optimization using Gaussian process models to improve sample-efficiency. Their main contribution involves recognizing that Pareto frontiers can likely be described as having multiple components, each of which is continuous but which represent disjoint regions in parameter space. The authors propose a strategy to exploit this knolwedge, implement that strategy (which is provided in the supplementary material), and test it on a selection of sample problems. Performance is defined in terms of the hypervolume of the associated approximate Pareto frontier, and their algorithm has superior performance on some of the problems.

Strengths: This article takes on a very timely topic (multi-objective optimization of expensive objectives) and correctly identifies an element which has been missing from existing strategies (the fact that Pareto frontiers are often comprised of multiple disjoint continuous sections). The authors effectively set the scene for the reader and concisely present their approach to exploiting this Pareto frontier knowledge. The authors have tested their strategy on a spread of problems, though the actual breadth of them is a bit limited (all these problems are fully continuous and, I think, <=6 dimensions). The article is well-written as well.

Weaknesses: The description of the diversity regions feels extremely important to the contributions of this paper. The section explaining how the diversity region is defined and implemented is only a single paragraph, which provides no insights whatsoever to the actual process which is occurring. It also provides no insights as to the implication of various decisions regarding this diversity regions definition. Personally, I would see this as a fundamental and necessary element of the analysis and experiments which are presented, given that the key novelty of your method is, as you have stated, is this desire for diversity. I would recommend the following: 1) Skip some of the intro content or push that to an appendix. Use the additional space to build out your explanation of the "Split points from ..." step in your algorithm, which is, I think, the key novel part to this article. 2) Provide a more complete explanation of how your diversity region algorithm works, what free parameters are present in it (and how they impact the result), how it functions under different circumstances (number of diversity regions, distance between them, size of diversity regions, extreme circumstances such as only a single point as the Pareto frontier), how many points are needed to find diversity regions with any accuracy. 3) Explain the implications of noise/randomness in function evaluations impacts your results. Realistically, any problem of interest (certainly in the NeuRIPS community) will likely have randomness/uncertainty in it. How does your algorithm deal with noisy objectives? How, specifically, does your diversity mechanism deal with noisy objectives (since noise complicates the ability to accurately define points as being on the Pareto frontier)? Similarly, any algorithm that requires solving the dual from the KKT conditions feels very subject to uncertainty (since the quality of your approximation to the Pareto frontier is complicated significantly by noise). 4) Explain how your ability to build these diversity regions plays a role in your algorithm's performance. Your Table 1 does not seem to provide any relevant information (though you can keep it if you would like). Your Figure 2 provides 20 experimental results, but no ability to understand what aspect of these problems made your algorithm more successful (in the cases where it was more successful). Can you explain or depict what part of the frontier your algorithm was able to find that the other methods were not? Can you identify any common structures/behaviors which yielded wins for your method? It seems like your hypervolume was much better for ZDT3, DTLZ4, VLMOP2, OKA1, RE2, RE6: do these have certain things in common? Or is this a situation where the baselines against which you are comparing are just very weak. Personally, I would only really have expected the TSEMO baseline to be of comparable quality. Another point I would like to make is that the numerical experiments are not terribly convincing. First off, there is also no ability to interpret increased hypervolume from a practical perspective. I do not dispute that a greater hypervolume is good, but I do not think that this provides a comprehensive sense of performance of a multiobjective optimization tool. Maybe providing the frontiers and actually pointing to the improvements which are observed would be beneficial. At present I cannot interpret these numbers or how relevant they are. Another point of interest for me is the batch situation. Obviously, the ability to execute in parallel is a relevant part of any practical sample efficient optimization strategy, but I feel like asynchronous parallelism is more the standard among open source tools than batch parallelism. Ignoring that, I'm also surprised that there are no sequential results. In principle, if your algorithm supposed to perform better or worse, compared to the performance gap we see in Figure 2, in a sequential setting? Along those lines, I do not understand why you are comparing against methods which do not have a parallel implementation, such as ParEGO (and I assume others). Why not simply run a sequential comparison, if what you are trying to argue is that your diversity methodology is a fundamentally beneficial tool? If it is only beneficial in a batch parallelism setting, I think you should be stating that explicitly to readers (and not handicapping other baselines by running them in unfavorable circumstances). One fundamental weakness of this algorithm is that it relies on a continuous approximation of the Pareto frontier. This seems impossible in circumstances with discrete parameters (which includes most problems of relevance) and also seems problematic in noisy circumstances (where these continuous predictions are then also subject to noise). This alone is not a barrier to publication, but would be an impassible barrier to usage in any actual circumstance (if the continuous approximation were a fundamental driver for quality).

Correctness: How were the choice of DGEMO hyperparameters made? How sensitive is the performance of your algorithm to those choices? I realize, of course, that some choice needs to be made; I just want to understand how this particular decision was made. Was there a great deal of experimentation, or were they just chosen randomly? Was any experimentation done using the functions that you show us in the Table 2? If so, Might that be biasing the performance of your algorithm on those problems? Was any experimentation done on the baseline hyperparameters to help improve performance? If not, how much of the gap that you are seeing in performance do you think could be accounted for if the same care was applied to the baselines as was applied to your algorithm? I would love to have also seen a uniform random baseline, just to set a lower bound on how hard/easy these problems are (and the gap between the other baselines and the "nothing" baseline). What is the implication of the dimension in the problem? What about the number of metrics? How do these impact performance of your method (relative to the alternatives)?

Clarity: The publication is clearly written in good English. A solid introduction and background discussion is provided to help set the scene for the reader.

Relation to Prior Work: This article has presented some discussion of other work. I would not say that all the relevant work has been mentioned (though I am not someone who cares that much). For example, http://proceedings.mlr.press/v48/hernandez-lobatoa16.html on the pure BO side. I also think that most of the good work on this topic has probably been done more on the application side. Some quick examples that I was able to find with only minor searching include: https://dl.acm.org/doi/abs/10.1145/3195970.3196078, https://www.osapublishing.org/optica/abstract.cfm?uri=optica-7-7-784, https://www.sciencedirect.com/science/article/abs/pii/S0925231219308525. You should not feel compelled to cite any/all of these, nor test against them as baselines (I only remember the Hernandex-Lobato paper because I happened to be there when it was presented to a skeptical audience). And I would never condemn anyone to having to run entropy search code. I really am bringing this up more to state that the application publications/community may have much more to say on this topic than the pure BO community.

Reproducibility: Yes

Additional Feedback: Page 6: spelling -- enfocing I think that this topic is good and worth talking about, and I feel that the authors have a good idea and the wherewithal to study it, but I feel like the analysis is incomplete. I think that if this algorithm can only work effectively in certain situations (such as only continuous parameters with >=4 batch parallelism and no noise) and will show no improvement in other situations, these things need to be explicitly stated. If this works fine in a broad set of circumstances, then I would like to hear about that also, but the experiments that are presented seem to just be chosen because they were readily available, not because they represent some specific topic or set of problems with specific characteristics which this algorithm is meant to address effectively. Without that structure, its impossible to understand whether any of the progress claimed by this paper is actually a function of the decisions they have made to address those circumstances, or simply a byproduct of weak/untuned baselines.

Review 4

Summary and Contributions: This paper proposes a novel batch Bayesian optimization method for multi-objective optimization. The batch selection method considers the diversity of both design space and performance space. The design space is first partitioned into multiple regions, then a greedy procedure picks points one by one, excluding previously chosen regions, by maximizing hypervolume improvement. Extensive experiments are conducted, both on synthetic and real problems, and superior results are observed compared to several baselines.

Strengths: extensive experiments and superior empirical results. code is provided, I'm convinced the experiments are reproducible.

Weaknesses: technical content seems to be mostly based on [39]

Correctness: appears to be correct

Clarity: this paper is well-written overall, but the main technical part can be hard to follow for people not familiar with the previous paper [39].

Relation to Prior Work: there are two main parts of the algorithm: 1. Pareto front approximation. This part seems to be adopted from [39] 2. Batch selection. This part also use the idea of performance buffer, which is also introduced in [39]. It's not clear to me which part is exactly novel.

Reproducibility: Yes

Additional Feedback: The technical novelty is somewhat limited in my opinion since most of stuff is based on [39], but it appears to have solid empirical results. update after rebuttal: the author feedback partially addressed my concern on novelty, I agree to accept the paper, but my score remains unchanged.