NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 2724 Bayesian Optimization with Gradients

### Reviewer 1

The paper proposes a novel acquisition function d-KG (for derivative-enabled knowledge-gradient) which is an extension of the knowledge-gradient (KG) by Wu and Frazier 2016. The main difference, to my understanding, is that the possible future gradients (batch as well as online evaluations) also affect the acquisition function, and not just the marginal posterior over the the objective, which the authors argue, should improve the selection of evaluation-locations. The authors also provide a way of estimating the gradients of d-KG in an unbiased way, such that stochastic optimizers can be used to maximize it. The method is evaluated on some classic synthetic benchmarks of BO and real work applications e.g. a logistic regressors and an MLP. The paper is well written and clear. The novel contributions are clearly marked as such and the connections and distinctions to existing methods that authors point out are very informative. I did not entirely follow the proof of Section 3.4, nor the derivation of \nabla d-KG since they were not contained in the main paper, but the line of argument contained in 3.3 sounds plausible to me. Some questions: - why is the noise on the derivative observations independent? likewise the independence between gradient and value observation. Is this purely for computational convenience/easier estimation of \sigma? or is there an underlying reason? - I am not quite sure how \theta is defined. Is \theta \nabla y (below Eq. 3.5) a projection? If yes, a transpose is missing. Also, must \theta have unit norm? - In Section 4.1, why did you observed the fourth derivative? From pseudocode 1 it looks as \theta is optimized for. A major practical concern: - You use SGD to optimize d-KG. How do you tune its learning rate? This might be quite tricky in practice, especially since you might need to re-tune it after each update of the GP and not just one time for the whole BO-procedure. Nitpicky comments: - line 168: is it a continuous' instead of an continuous'? - y-labels on Figure 2 would be nice. - line 242: We also...': word missing? - line 288: are can': one word too much? - please don't call a shallow MLP with 2 hidden layers `deep neural network' (lines 283 and 292 and the conclusion) (*Edit:* I increased the score since I found the arguments of the authors convincing. But I encourage the authors to comment on the learning rate tuning as they promised in the rebuttal.)