NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2410
Title:Training Language GANs from Scratch

Reviewer 1

- the text in the intro claims that MLE language models cannot perform one-shot language generation. this statement isn't exactly true; there has been a lot of recent work on non-autoregressive generation from conditional LMs (mostly within machine translation) that could be cited as a counterpoint here (e.g., Gu et al., ICLR 2018). this "contribution" should thus be toned down. - many typos throughout (e.g., "World level perplexity" in table 2, "model ground through" in line 244) - the evaluation results leave me unsatisfied. while ScratchGAN does seem better than other GAN-based alternatives, the perplexity difference between it and the MLE model on wikitext103 is immense. The authors attempt to explain this difference in lines 245-250, but I didn't quite catch the drift of their argument (isn't it bad that ScratchGAN does not favor diversity during training?). However, looking at the generated samples from the MLE model and ScratchGAN in the appendix, it is clear that the huge perplexity difference actually corresponds to noticeable differences in grammaticality and coherence. - Why are there so few samples provided with the ScratchGAN after training? The supplementary material should have way more samples from each model so we can judge their relative quality, especially since the evaluation metrics used here are (outside of perplexity) hard to judge. - From my perspective, it is a stretch to say that ScratchGAN performs "comparably" to MLE trained models.

Reviewer 2

Originality: There's moderate novelty in the methodological contribution mentioned above. There's nice discussion of related work for language GANs but this field moves a bit fast and there are a couple of new papers that are not mentioned. Quality: Besides the methodological contribution, the paper does a really good job trying to evaluate language GANs with many metrics to measure the diversity and quality of generated sentences. The results look great and I always appreciate a good ablation study, which the authors did, and with such great graphs to visualize the additional contributions for each technique (Fig 3). Clarity: The paper is quite well written. The experimental details are provided in the supplementary materials which helps with the reproducibility. I wish there'd be a link, anonymous, to the code however. (why not?) Significance: This paper certainly explores a missing gap of how to train GANs for natural text which is an interesting direction. To me, I'm still not entirely convinced about the superiority of language GANs for text generation over language models. As far as I know, they GANs are still substantially worse than LMs (Tevet et al, 2019). In additions, can language GANs really scale to large neural nets such as using the Transformer? The balance between the discriminator and the generator power are quite delicate and it is progressively harder to tune once the discriminator and the generators are more powerful. There's hope however (for example, BigGANs for images) but this is still a somewhat uncharted territory for GANs. Can you please address the scalability? I am glad of the improvements made so far to use GANs for language, however.

Reviewer 3

[EDIT after author rebuttal]: Thank you very much for the rebuttal, it helped clarify some of the issues, especially those regarding comparison against MLE and potential overclaiming. I've raised my score accordingly, but I still think that there needs to be more solid results. In particular, while the rebuttal notes that ScratchGAN can almost match the MLE baseline, I am not sure how strong the MLE baseline itself is. Based on sample quality, I suspect that the MLE baseline itself is quite weak and does not use more modern LM approaches (e.g. regularization). Of course, I am not saying that the authors deliberately used weak baselines, but it would be helpful to compare against stronger MLE baselines too. -------------- Strengths: - Isolating the sources of contribution was nice to see, although it would also have been nice to see this on other metrics than FID. - I appreciate the negative results in Supplemental section C. - In general the paper was very well written and easy to read/understand. Weaknesses: - The main weakness is empirical---scratchGAN appreciably underperforms an MLE model in terms of LM score and reverse LM score. Further, samples from Table 7 are ungrammatical and incoherent, especially when compared to the (relatively) coherent MLE samples. - I find this statement in the supplemental section D.4 questionable: "Interestingly, we found that smaller architectures are necessary for LM compared to the GAN model, in order to avoid overfitting". This is not at all the case in my experience (e.g. Zaremba et al. 2014 train 1500-dimensional LSTMs on PTB!), which suggests that the baseline models are not properly regularized. D.4 mentions that dropout is applied to the embeddings. Are they also applied to the hidden states? - There is no comparison against existing text GANs , many of which have open source implentations. While SeqGAN is mentioned, they do not test it with the pretrained version. - Some natural ablation studies are missing: e.g. how does scratchGAN do if you *do* pretrain? This seems like a crucial baseline to have, especially the central argument against pretraining is that MLE-pretraining ultimately results in models that are not too far from the original model. Minor comments and questions : - Note that since ScratchGAN still uses pretrained embeddings, it is not truly trained from "scratch". (Though Figure 3 makes it clear that pretrained embeddings have little impact). - I think the authors risk overclaiming when they write "Existing language GANs... have shown little to no performance improvements over traditional language models", when it is clear that ScratchGAN underperforms a language model across various metrics (e.g. reverse LM).