NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2302
Title:Flow-based Image-to-Image Translation with Feature Disentanglement

Reviewer 1

Original: the paper is original in the way it uses flow based generative models for 1-to-many image to image translation. The recent paper [21] also uses conditional flow based models, although there are differences. Quality: The proposed approach is technically sound, and supported by good results in the experimental section. However it would have been nice to add into the comparisons the work of [21]. Was this omitted due to how recent that paper is? Also the way conditions were introduced into the CelebA dataset seems quite artificial: if only one couple of images were selected as prototypes for similing and not smiling, there might be many other differences between them that the model may pick up, and not only that of smiling. Which network was used to compute the FID in the CHC dataset? One last point in this regard, \mu(c)_i is supposed to be 0 when M_i is 0, but this is only so if b_\mu is also 0 (same for \sigma). Are the biases terms not used in practise? Clarity: the paper is overall well written, and the loss functions are well motivated. Some parts could be made clearer, particularly when going through eq. 1 and 2, there should be a reference to Fig 2c. I would make Fig 2c a new figure on its own, as IMHO it is the diagram that provides the most information about the approach. Significance: results look clearly better than competitors, although in an unconventional set of experiments. Still it may become a reference for other researchers working in similar problems. It also provides a new instance of that problem (CHC), which may be useful for the community. Edit after rebuttal: Having read the other reviews, and given that the rebuttal addressed my concerns (except that regarding the use of Inception-V3 to calculate FID for CHC, as CHC images do not look anything like natural images used for pretraining Inception-V3), I agree this paper should be published, and hence I raise my assessment from 6 to 7.

Reviewer 2

Overall, I like the idea and the task is also interesting. - For CelebA dataset, the authors choose “smile” or “not smile”. How is the other attribute cases such as young-aged or blond-black? - As the metric, classification accuracy for condition might be a metric for comparison. - As the authors said, a flow-based model such as Glow can generate high-quality images. But the generated face images are difficult to be high-quality. Could FUNS deal with higher-resolution datasets such as CelebHQ? If not, what is the main advantage of the FUNS against GAN or VAE-based I2I models? - Ablation study is required with respect to losses and modules. Minor Liu et al. [1] is a Flow-based I2I model performing conditional image generation. Even if [1] was published after NeurIPS deadline, I recommend revising the Introduction because the authors claimed their work is the first flow-based I2I in the Introduction. This work was referred in Section 3.1. Line 96 in page 3, merginalizing → marginalizing [1] Liu et al. Conditional Adversarial Generative Flow for Controllable Image Synthesis. CVPR 2019. [After rebuttal] I carefully read the other reviewers' comments and author feedback. The authors alleviate most of my concerns. Even if the classification accuracy is not competitive, the overall performance looks promising, considering the diversity from LPIPS score. Therefore, I update my score to 7.

Reviewer 3

Flows maximize the likelihood of observed images in a latent space and are formulated invertibly, such that at test time, latent space samples can be decoded into images. In their vanilla formulation, their setup does not allow to to condition on external attributed or images. This submission proposes an adaptation of flows for image-to-image translation problems and as such is an interesting and original work. The paper is overall well structured. The general idea is somewhat hard to understand until Section 3.3 / 3.4. This is also in part due to the complex nature of Figures 2a) -2c), which don’t aid much in clarifying the idea that underpins the model. It would be beneficial to state the basic idea in simple terms early on in the paper. I would have liked to read something like `The model consists of a U-Net with multiple scales of latent variables between its encoder and its decoder. These latent spaces encode an image c to condition on, which is trained by reconstructing c. Simultaneously a flow based model is trained to associate possible outputs x with the given image c by maximizing the likelihood of the latents z that the flow produces given x under the U-Net’s latent space densities for c’. Training the encoder-decoder is done using 3 terms, a L2-reconstruction term on the output space, an L1-term to encourage a sparse encoding of z|c by means of a mask that yields Normal and thus uninformative latent distributions and an entropy term on z|c that encourages the `unmasked’ latent distributions’ to have low variance. This setup seems a bit contrived and could potentially be avoided / done in a more principled way by training the encoder-decoder via variational inference? For improved clarity of the training objectives, the loss terms (left-hand sides of Eqs. 4, 5 and 6) should have the parameters with respect to which they are optimized as a subscript. It would further be interesting to have an analysis of how many of the latents end up being masked (surely a function of the weighting terms). Also, having latents at all scales, it appears the encoder has an incentive to go the easiest route and encode as much as possible in the scale with the highest resolution, which thus wouldn’t require finding a more abstract and semantically structured latent space? In this context the means (\mu) of e(z|c) of individual encoder scales could be fixed to investigate what individual scales encode, in case it can be teased apart. It would generally be helpful to give the dimensions of the latent tensors at all scales, to better understand how the dimensionality d_x is split across scales. The employed metrics test for the fidelity and the diversity of the generated images only, but do not seem to test for whether they are appropriate / plausible, given the image to condition on. Other datasets allow to do so more easily, e.g. image translation between street scene images on Cityscapes and its segmentation, which was considered by other image-to-image translation works. A way to quantify conditional plausibility on CelebA could be to pretrain a classifier to classifiy the ~40 attributes of each image and use this classifier to quantify whether they are conserved / altered in the generated images. This seemsan important analysis given that the proposed model does not have a deterministic path from c -> x, which means there could potentially be a very weak correlation only. Additional the likelihood of ground truth images x under the encoder could be reported, so e(f(x)|c). The PU-Net is reported to produce blurry results, but it has not been stated what the exact architectural setup and training procedure for this baseline (and for the VU-Net) was. There are various typos and errors in grammar such as `merginalizing’, missing white spaces, wrong/missing articles (`a’, `the’), `there is no oracle that [?] how large’, wrong usage of colons in datasets description. [After Rebuttal] The rebuttal addressed the comments and questions I had raised. My score remains at 7.