Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
============ Update: In response to the authors' response, I've raised my score to 8. ============ Significance: The paper achieves a generative model capable of generating samples comparable to GANs, without a difficult adversarial training process or massive amounts of compute and data. I think that the potential impact of such a thing is very large. Quality: The paper makes a careful analysis of problems with score matching for generative modeling. The claims are specific, and each claim is backed up with a carefully done toy experiment. The method is well-motivated and few choices seem ad-hoc. The image generation experiments aren't fantastic, but they're more than sufficient to support the claim that the model performs comparably to GANs and better than any other non-GAN-related method I've seen (I'm counting MoLM and https://arxiv.org/abs/1903.08689 both as GAN-related methods). Originality: I don't have a deep background on score matching or energy models, but to my knowledge the ideas presented are novel. Even if they weren't, I would argue that putting them together into an algorithm which achieves competitive performance is a novel contribution in its own right. Clarity: The paper is well-written, with enough background and detail to be easily understood. I do have substantial criticisms which I think should be addressed before this paper is published (see the 'improvements' section of the review), but if these are addressed I recommend acceptance.
UPDATE: the authors have taken care to address all of my concerns in their rebuttal. Please ensure that they are also incorporated in the camera-ready version of the paper. The experiments on CelebA 64x64 are a welcome addition, although I would argue that this dataset isn't necessarily more challenging: the higher resolution is compensated for by simpler image content (carefully aligned faces). The fact that the cost of sampling does not increase is very promising though. I look forward to seeing this method being scaled up further to the level of adversarial and likelihood-based models in the future. I would also encourage the authors to try out varying the noise level continuously. The model could be conditioned on the logarithm of the variance of the noise, for example. In light of the quality of the rebuttal I have raised my score further. --------- The paper describes an alternative paradigm for generative modelling using score function estimation. Although several previous works have used estimators based on score functions to train generative models, these models usually try to capture the (unnormalised) density. Here, the score function is modelled directly instead. Sampling from such a model can be done using Langevin dynamics . As far as I know, previous attempts at modelling score functions directly have not been successful. It is worth noting that Saremi et al. attempted something similar in "Deep Energy Estimator Networks" (also co-authored by Aapo Hyvarinen), but report that their experiments failed (top of page 7 in the arxiv version of their paper). I think it constitutes relevant related work and I would appreciate a comment from the authors regarding Saremi et al.'s discussion about modelling the score function vs. the energy function, as they seem to imply that modelling the score function directly is not feasible, whereas this work clearly demonstrates that it is. The authors observe that score estimation is difficult in low-density regions, which poses a problem for sampling through Langevin dynamics, which is likely to have low-density starting point. They propose adding random noise to the data and training a single noise-level-conditional model that can capture the score at different levels of noise. In some sense, this parameter sharing across noise levels allows the score estimates at different noise levels to regularise each other. I think this is a clever strategy to tackle this issue: although it means that we now need to anneal the noise level during sampling, the Langevin dynamics sampling procedure was already iterative anyway, so this doesn't actually complicate matters in terms of computational expense. One thing was not entirely clear to me though: in practice, a number of discrete noise levels are chosen, and the model is only trained for those noise levels (and not the levels in between). Since this is a continuous parameter, why not vary it continuously, at least during training? It would be helpful to clarify the motivation for the discrete strategy in the paper as well. Apart from this one point, the paper is very clearly written and easy to follow. Other comments: - line 33: the statement "we explore a new principle for generative modelling" confused me a bit as I was thinking of score matching, which arguably isn't new. More clearly distinguishing the proposed method from previous uses of score matching for generative modelling in literature could clarify things at this point. - line 115: this mention of "covariant derivatives" was also confusing, and I had to go look up what they actually are. It could be helpful to add a very brief explanation in a footnote or in parentheses, or just remove the mention altogether as the concept is not used in the rest of the paper. - line 237: I don't think "higher fidelity" is appropriate or justifiable here, and "comparable fidelity" suffices. The samples definitely don't look noticeably better than some that I've seen before from adversarial / likelihood-based models. If the authors want to make this point, it would be better to compare with samples from state-of-the-art models side by side in the figures. - line 243: this is a really nice result, as it demonstrates the necessity and effectiveness of the proposed noise-conditional modelling approach. It is unfortunate that it has been relegated to the appendix, but I understand that space constraints probably led to this decision. If there is room in the camera-ready version, I think it would be great to have this result in the main paper. - line 265: This statement could be a bit misleading as both likelihood-based and adversarial models have been scaled up far beyond 32x32 images, and "high-fidelity" tends to imply higher resolutions than this.
the paper discusses a new learning principle of score-matching in the context of generative models. while score-matching is a pretty classical idea, the paper nicely demonstrates its power on large scale generative models and addresses the unique challenges posed up modern images. I have only one major comment. it is nice to see how score matching is able to generate non-blurry image examples. however, it came as unintuitive specifically because the learning algorithm adds gaussian noise --- gaussian noise in VAE is known to be the culprit of image blurriness. on the other hand, https://arxiv.org/abs/1903.05789 has shown that if we get the dimensionality of the manifold correct, we are also able to get non-blurry images. so I wonder how to reconcile the intuition of non-blurry generated images in the presence of gaussian noise? or is the generated images non-blurry because it gets the manifold dimension correct? if so, is the gain of the proposed algorithm coming from the new learning principle or getting the manifold dimension correct?