NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:1146
Title:Improved Techniques for Training GANs

Reviewer 1

Summary

The Authors provide a bag of tricks for training GAN's in the image domain. Using these, they achieve very strong semi-supervised results on SHVN, MNIST, and CIFAR.

Qualitative Assessment

This is a strong paper and it should be accepted. This is the first work to get strong quantitative results using GANs, and indeed, the semi-supervised learning results are the best I have seen reported. On the negative side, the work is peppered with strong statements that are not necessarily supported by the work, e.g. "The reason appears that the human ..." on line 259, but all in all, this paper should be accepted.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 2

Summary

The paper presents a set of techniques to improve the optimization of generative adversarial neural networks, and also proposes several ways of evaluating the performance of these adversarial networks. The authors then train the improved model on several images datasets, evaluate it on different tasks: semi-supervised learning, and generative capabilities, and achieve state-of-the-art results.

Qualitative Assessment

The most impressive part of the paper is the experiments section, that demonstrates all the potential of the adversarial framework for machine learning. First, authors propose an interesting way of incorporating a generative model into the discriminative training objective. They achieve excellent classification performance with few labels, outperforming all models in the benchmark, including the ladder network that was also shown to excel at this task. Also, the generated examples on ImageNet are very convincing and have the necessary visual properties of high quality generated samples (realistic looking, but not too close to the data points). The section where technical improvements are discussed is less clear, in particular Section 3.2. Some variables are not defined (e.g. B and C in section 3.2). Figure 1 is hard to read.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 3

Summary

This paper investigates several techniques to stabilize GAN training and encourage convergence. Although lack of theoretical justification, the proposed heuristic techniques give better-looking samples. In addition to human judgement, the paper proposes a new metric called Inception score by applying pre-trained deep classification network on the generated samples. By introducing free labels with the generated samples as new category, the paper proposes the experiment using GAN under semi-supervised learning setting, which achieve SOTA semi-supervised performance on several benchmark datasets (MNIST, CIFAR-10, and SVHN).

Qualitative Assessment

The results presented in the paper are impressive and significant enough. However, the results are quite empirical, non-conclusive, and lack of theoretical justification. For rebuttal, please focus on answering the (*), (**), and (***) mentioned in the following paragraphs. Reviewer is willing to change score if all the questions are well addressed. Novelty: The techniques proposed in the paper is novel in general. However, the proposed technique “feature matching” when training GAN has been explored to some extent: -- Generating Images with Perceptual Similarity Metrics based on Deep Networks by Dosovitskiy and Brox -- Autoencoding beyond pixels using a learned similarity metric by Larsen et al. The “mini-batch discrimination” is interesting. Clarity: The paper is clearly written but the discussion on experimental results are not conclusive and fail to provide some high-level insights. Technical quality: -- Regarding convergence: although the paper shows better-looking samples and SOTA performance in semi-supervised learning, there is no clear proof that the proposed methods encourage convergence or not. It would be much more convincing if the paper compares the learning curve with naive GAN training. *For rebuttal, please comment on the convergence. -- Quality of samples: it is not known whether the better-looking samples is sign of over-fitting (faster convergence may lead to over-fitting). As mentioned in the paper, the “mini-batch discrimination” fails to give better performance in semi-supervised learning. **For rebuttal, please resolve the issue and make conclusive judgement of the proposed techniques. -- Semi-supervised learning: the proposed semi-supervised framework (starting from line 219) is somewhat misleading. On one hand, samples generated by GAN can help to regularize the training without many labelled data. On the other hand, the performance will get hurt if the GAN training converges since the samples may resemble some “real” categories but the assigned labels are different. ***For rebuttal, please comment on the performance of different variants of GAN (e.g., with or without “feature matching”) on semi-supervised learning. Usefulness -- In terms of application, it makes little difference if one can generate better-looking but still unrealistic images (see Figure 6) on imagenet. Reviewer suggests the paper evaluates the performance under some constraints (e.g., face images or indoor scene images as did in [2] and [3]). Face editing is another interesting experiment considering the GAN has been improved (see “Auto-encoding beyond pixels using a learned similarity metric”). Detailed comments: -- Line 226 seems inconsistent with Line 227 -- Line 230 -- Figure 4: better to provide nearest training examples while visualizing the generated samples.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 4

Summary

This paper presents several techniques for improving the stability of GAN training and shows how adversarial methods can be used to achieve state-of-the-art performance in a semi-supervised setting. The authors present empirical results supporting the value of their approach.

Qualitative Assessment

The techniques described in this paper, both for improving GAN training and for applying adversarial methods in the semi-supervised setting, seem likely to be reused by other researchers. These topics are very active, the techniques are straightforward to implement, and the results are promising. A strength of the proposed approach to semi-supervised learning, which isn't mentioned in the paper, is that it avoids an expensive marginalization over classes that's required by competing VAE-based methods. I don't quite like the statement "nor do we require the model to be able to learn well without using any labels" from the abstract. This seems like a bit of tricky marketing. It's somehow trying to sell the requirement of additional information during training as a benefit rather than a downside. Additionally, a number of papers have previously noted that incorporating label information while training a generative model can improve its qualitative performance. The authors should make it more clear that their best qualitative results rely on a large amount of labelled data, and are thus unobtainable in the unsupervised setting. The authors should mention in their SVHN results that the methods they're competing against didn't use conv nets (as far as I know). Figure 2 should be made a bit larger if possible, so zooming in isn't strictly necessary. Sample sizes for the MTurk experiments would be appreciated. Results on 64x64 ImageNet and/or LSUN data would be interesting to see. 128x128 is clearly still beyond current models, but at 64x64 something more convincing might pop out. It might be worth pointing out that the proposed feature matching technique is literally MMD with a linear kernel in an adversarially-learned feature space, rather just being "similar in spirit". Distribution matching in an adversarial feature space was previously proposed by (Lindbo Larsen et al., 2016), albeit using variational KL minimization rather than MMD. The proposed method for extracting batch-level features seems somewhat arbitrary. It would be helpful if the authors could provide some intuitive motivation for their choices, or describe a few of their attempts at constructing suitable features that didn't work out. One reason the proposed approach to semi-supervised learning might perform so well is that adversarial training forces the classifier to squash out low density regions separating the classes on the data manifold. These regions will be under-constrained in the fully-supervised setting, as they won't provide any data during training. In the semi-supervised setting, the classifier/discriminator will be forced to assign these regions to the "generated" class, which may provide stronger constraints on class boundaries in the real data.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 5

Summary

This paper presents a collection of practical techniques for improving the training of GAN models and the quality of the samples from the learned models. Most of the techniques are motivated heuristically, but seems to help from the experiment results. A new application of the GAN models to semi-supervised learning is also proposed, and the proposed approach achieves great results on MNIST, CIFAR-10 and SVHN semi-supervised classification.

Qualitative Assessment

This paper presents a collection of techniques for improving GAN training. Most of them are motivated heuristically, and the results are evaluated empirically. The collection of these techniques, are effective and GAN models trained with these techniques are able to generate very good quality samples on MNIST, CIFAR-10 and SVHN. I found the semi-supervised learning approach very interesting, and I think it should make the samples look more like examples from the classes, thus improving perceptual sample quality. This is a natural result of optimizing the loss presented in section 5. A big part of the improvement of sample quality may actually come from this semi-supervised learning. However this is not verified very thoroughly. In Table 3, the “Our methods” vs “-L” comparison seems to be verifying the use of semi-supervised learning vs not, if I understand the meaning of “L” correctly. However, the sample quality of the “-L” model seems rather bad, and a reasonably well trained standard GAN model should do a little better, at least the obvious artifacts can be removed. On the other hand, the experiment results show that good generative model of images do not help that much for learning classifiers in semi-supervised learning, while good classifier models learned in a semi-supervised way do not help training good generative models. So clearly something not well understood is going on, and our intuitions might be incorrect or incomplete. On a separate note, having achieved state-of-the-art semi-supervised learning results is nice, but from the paper it is hard to see if the methods are directly comparable. It would be good to have some more careful control over the experiment setup, for example by making sure the classifier networks having the same architecture as previous methods, and then demonstrating the benefit of using the proposed semi-supervised training procedure. Otherwise the picture is mixed. The proposed Inception score metric makes intuitive sense. However, a bad generative model that generates nothing like any of the classes, thus having p(y|x=G(z)) distributing equally across all classes, seems able to fool this metric. It probably makes sense to measure the quality of models in similar nature using this metric, but if comparing models of different nature, for example GANs vs VAEs the results may not be very comparable as the behavior of this metric is not well understood. The formulation of the minibatch discrimination features is a bit arbitrary for me. I wonder if there are any reason behind actually choosing such a formulation. Presumably there are other ways to introduce minibatch features. At last, the paper proposes the techniques with the goal of making GAN training more stable. But none of the experiments actually show that the GAN training is more “stable” as a result of these techniques. The results mostly focus on quality of samples and semi-supervised learning results, but do not touch any measure of “stability”. Maybe it would be better to change the phrasing a little bit? Or maybe it would be more interesting to show some results demonstrating that these techniques actually improve “stability” with some understandable measure, even qualitative ones. Overall I like the results of this paper, and some of the proposed techniques are novel, the semi-supervised learning approach is also interesting. There are a lot of unanswered questions but some of it may be left to future work.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)