Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Overall I thought this was an interesting paper which was well written with some nice ideas. I can see the potential for impact of their work. My main concern is that the authors don't give a good intuition for why their idea works and I have doubts that there might be (1) something special about the ZINC dataset or (2) some important knowledge about chemistry and/or the dataset which was necessary for their results. While the writing was clear I was left with a lot of questions. The most significant ones are marked with **. Gibbs questions: ** What is the initial state for the gibbs sampler? Did it come from the real dataset? Was only one initial state used? Does it change the result if you use different initial conditions? How much burn-in was needed? I tried to find answers for these questions, apologies if I missed them. ** Figure 1 seems disjoint from the paper. It seems to describe the chemical embedding process but it refers to (1) attention and (2) message passing, both of which are never mentioned again. Expert input: I imagine that the results are highly dependent on the corruptor used. This seems like an important part of the paper and I was required to read the appendix to understand your research. I still was left with many questions: Who wrote the corruptor? Were they an expert in Chemistry? Can the authors expand on how the corruptor choice impacts performance? Are there bad corruptors that don't work? Is chemistry special? I find it surprising that the corruptor, which to me looks basically like a random walk along legal moves, provides a good match for the true distribution of chemical molecules. I would imagine that there are common tropes in chemistry which are not visited often. Perhaps the authors will suggest that its the denoising auto-encoder which is learning about these chemical tropes -- but that is especially suprising. ** Can the authors explain their success? Is there something special about the ZINC dataset that makes this possible? Would this, for example, work for generating student code-solutions (see code.org/research)? Baselines: ** What if you just run the corruptor? How well does it do? ** If an expert is able to define generative legal moves, can they also write a generative grammar? ** Would the expert be able to generate the corruptor if they didn't have access to a ground truth dataset? Is the generative process of chemicals markovian in real life? The authors cite denoising auto-encoders as the motivation behind their corruption-reconstruction idea. How important was the Gibbs contribution? If you just applied the denoising auto-encoder idea directly as suggested in  perhaps you wouldn't necessarily need Gibbs sampling. Possible confounds: ** How is the frequency of molecules distributed? I imagine some molecules are more common than others -- are they zipfian distributed? If so the results could be dominated by a few common examples. ** What influences from ZINC could have been used to construct the synthetic dataset? Specifically was any data from ZINC looked at by the researchers or shown to the model?
- The method itself is well-motivated and generally makes sense. The problem of generating discrete objects is clearly a very challenging topic, and the paper proposes a simple and reasonable solution, which clearly could be of interest to the community. - Authors clearly indicate the main limitations of their approach. - The paper is well-written and is easy to follow.
originality: Using the denoising (for some reason my autocorrect always wants me to write "demonising") autoencoder framework to model structured object construction as a markov chain where transitions are local edits is in my opinion well justified, and an interesting alternative to the previously described models. clarity: - I found this paper quite hard to understand (my background is not statistics). For a general ML conference like NeurIPS, I would suggest to write the paper in a more self-contained matter. Going to the preceding, cited work (Bengio et al 2013) was necessary to get a better idea about what the authors did in this paper. quality: + the disadvantage of the model (expensive sampling) is expressed - The authors yet again introduce a another benchmark for molecule generation. This reviewer has now reviewed 15 papers for molecule generation for NeurIPS, ICML, and ICLR. 14 of them introduce new benchmarks, instead of reusing already established ones created by domain experts. The authors should be required to use the established guacamol benchmark for molecule generation: https://github.com/BenevolentAI/guacamol - Machine Learning has made most progress when a single benchmark was used in many papers (Imagenet!!!) - the CGVAE paper (Liu, Allamanis, Brockschmidt, Gaunt, NeurIPS 2018) and the DeepGAR paper indicate that a simple autoregressive LSTM model on SMILES strings can outperform those models (even though the SMILES LSTM is probably the most boring model in the world). It therefore should to be added as a baseline, regardless of the outcome of the results. It may also be interesting to look at the reviews of the Liu et al paper https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips31/reviews/4855.html . An implementation of the SMILESLSTM can be found here: https://github.com/BenevolentAI/guacamol_baselines/tree/master/smiles_lstm_hc - this model was first reported in https://arxiv.org/abs/1701.01329 Questions: Could the authors point out how the model can be used for structured object optimisation, that is finding objects with optimal properties? _______________________ Added after the authors provided their rebuttal: Thanks to the authors for addressing the questions. I have adjusted my score to 8, and would vote for acceptance, under the condition that the authors add the results of the Guacamol benchmark & baselines, regardless of the outcome (as they wrote in the reply), and add a comment in the outlook on how to employ the model for optimisation tasks.