Review for NeurIPS paper: Generative View Synthesis: From Single-view Semantics to Novel-view Images

NeurIPS 2020

Generative View Synthesis: From Single-view Semantics to Novel-view Images

Review 1

Summary and Contributions: This paper proposes a method for synthesizing novel views from an single input semantic map. The contribution of the paper is that it builds layered representations on the semantic map, and transforms each semantic layer to MPI appearance; finally it combines different MPI to synthesize novel views. The paper shows that the proposed method is better than first synthesizing the image and then predicting novel views from the synthesized image.

Strengths: 1. The idea of first lifting semantic maps to 3D and then doing transformation at each layer is interesting, and the paper shows that it produces better results than first synthesizing an image and then performing view synthesis.

Weaknesses: 1. The technical contribution is kind of incremental to me. First the task itself is just a combination of two tasks: image-to-image translation and novel view synthesis from a single image. While the authors show that doing translation on layered semantic maps is better, I still find that each component in the network has been well studied in previous works. Therefore, while I don't any technical problems, I don't find many exciting points in the paper. 2. Most of the results shown in the paper have a low resolution of 256 x 256. Such a low resolution makes the results and comparison not that impressive. 3. In comparisons, a more reasonable baseline is that in the second stage when you apply other view synthesis methods to the translated images, you also feed the semantic map to the network. In that way, the view synthesis network will be able know more about the structures of the scene. 4. The applications shown in Section 4.1 are standard for each individual task. I don't see the need to emphasize them in the paper. To summarize, I don't think the paper has enough contribution for acceptance to NeurIPS.

Correctness: The paper is technically sound.

Clarity: The paper is well written.

Relation to Prior Work: The reference is good.

Reproducibility: Yes

Additional Feedback: Post-rebuttal: The additional comparisons in the rebuttal addresses my concerns, and I agree with accepting the paper.

Review 2

Summary and Contributions: The paper discusses a method for generating novel views conditioned on a semantic maps, similarly to image-to-image translation methods. It first lifts semantics to a 3-layer semantic images as well as obtains a transparency pyramid. Then, from the semantics the layered appearance images are generated. Finally, the appearance is projected onto the target view and proceed by a small network to obtain the final image. There exist is no state-of-the-art methods that exactly match what the paper proposes. Therefore, they compare with obvious baselines only.

Strengths: 1.1 The paper proposed a novel task as far as I can tell--view-conditioned image to image translation. The cost of this extension is the need for stereo data. 1.2. To solve the new task the authors address several issues arising in this extension, such as specialized networks, memory efficiency etc 1.3. Comparisons and ablations seem to support the method

Weaknesses: 2.1. The task. Although, the task seems interesting, I believe the paper would benefit from a deeper discussion on the benefits the view-conditioned i2i can bring. Section 4.1 provides some, such as semantic editing and semantic maps to depth. While the former is exactly what i2i offers, the benefits of the latter are not obvious. Why would we want to infer depth from semantics? Why this is useful? I also think it's unfair to compare monodepth and the present work (such as done in Fig 7), since monodepth has to estimate object boundaries, while the present work get them as the input. 2.2. Lifted semantics. Different layers of the lifted semantic seem not to model layered representation of the scene. If we look at the Fig. 3 (middle) we see that the fence disappears on the L3 only. In this example, the fence should intersect all the planes, different layers should contain parts of it. Therefore, I believe, this is not truly MPI-work, but some sort of semantics decomposition or occlusion removal. 2.3. Applications. Can the method be applied to anything except street scenes? SPADE, for example, can be applied to dancing people, surfing, scenery, etc. Can the present be applied to such settings? I assume, it will require stereo or multiview data, but I was wondering whether the authors have considered something like that. 2.4. Baselines and experiments. Given the nature of the problem, I believe, the authors can consider an additional comparison. For synthetic data, such as Virtual Kitti, one can actually generate a semantic map and an image for the novel view. The latter can be used to run some sort of reconstruction loss. The former can be used by traditional i2i frameworks to generate the output image for comparison with the current method. 2.5. Evaluation metrics. In lines 218-222 the authors mention that a generative view synthesis system should have two properties (1) Semantic preservation, (2) Photorealism. I respectfully disagree with this and argue that these are the properties of a classical i2i system. Instead, I believe, view synthesis methods should show view consistency, be able to reliably reconstruct novel views given input images. This has to be done numerically rather than visually as the present paper offers. As far as I can tell, the paper did not provide quantitative evaluation of view consistency/reconstruction. Can the presented method generate better image than the obvious baselines? Yes. Are the novel views consistent and plausible given the input semantics? The answer is not given.. 2.6. Two stage system. I believe a one-stage system will be more interesting, as staging adds another set of hyper params, such when to stage, with which learning rate etc. These are usually application specific and hence increase the complexity of the system, without providing anything useful. The authors mentioned that staging is due to the hardware limitations of the authors. I believe this is a weak argument to not evaluate the end-to-end system.

Correctness: I believe to fully support the claims, the weaknesses has to be address, especially 2.1, 2.3, 2.5.

Clarity: The paper is very well written, is easy to read and comprehend.

Relation to Prior Work: A couple of related works are missing: [A] Tulsiani, Tucker, Snavely. Layer-structured 3D Scene Inference via View Synthesis. In ECCV, 2018. [B] HoloGAN: Unsupervised learning of 3D representations from natural images etc

Reproducibility: No

Additional Feedback: A note on reproducibility. I believe the paper is hard to reproduce unless the source is provided. In my experience, method consisting of numerous components and training stages are very hard to re-implement correctly. I'm positive about the paper and would like to see my concerns addressed by the authors.

Review 3

Summary and Contributions: This submission presents an approach for generative view synthesis, i.e., a combination of novel view synthesis and semantic image synthesis that allows to regress a representation from a semantic map that enables novel view synthesis instead of just generating a single image. To this end, this approach combines recent advances in semantic image synthesis with advances in novel view synthesis. On the technical side, the approach is a deep network with multiple components: The combination of a novel semantic uplifting network, layered translation network and an appearance decoder. The results produced by this approach are of good quality. In a comparison, it is also demonstrated that a simple baseline that combines a recent novel view synthesis approach with a recent semantic image synthesis approach in a naive way is not sufficient to achieve good quality.

Strengths: This paper defines a now problem domain that is really interesting, i.e., the generation of a representation from a semantic map that can be rerendered to novel views. To this end, the approach proposes to lift the semantic map to a semantic MPI representation that is 3D-aware, this representation is then converted to a feature-based representation that can be rendered to novel views. Finally, a rendering network converts the projected features maps into a realistic image. The presented results are of good quality and the approach clearly outperforms the baseline approach. Comparison to multiple baselines are performed that are based on different semantic image translation and view-synthesis approaches. It is clearly visible that these naive baselines struggle to produce high quality 3D-consistent results. In addition, several applications that are enabled by this approach are shown, such as semantic editing and depth estimation. In addition, the results can be conditioned on the style of another specified image.

Weaknesses: It would be good to discuss the limitations of this approach in more detail to open up the field for future work.

Correctness: Yes.

Clarity: Paper is well written and opens up a new and interesting research direction.

Relation to Prior Work: In summary, I believe the idea presented in the submission is novel and opens up a new and interesting research problem. The method is clearly described and technically sound. It will inspire follow up work on this problem.

Reproducibility: Yes

Additional Feedback: None.

Review 4

Summary and Contributions: The work bridges novel-view synthesis (NVS) and semantic image-to-image translation, in particular, semantic image synthesis (SIS, mapping semantic segmentation to appearance). Both could simply be done in consecution, so NVS on the semantic labels, then SIS or first SIS , which is then subject to NVS. Both leads to different kind of inconsistencies. Sothe paper advocates to do that jointly. In particular it is argued that the semantics should live in 3D (in somethign like a multi-plane image, MPI), but not containing appearance features, but semantics. Next these are converted into multiple planes of appearance, which are then composed, resulting in consistency. Evaluation compares to the two trivial solutions or some ablations, indicating an improvement in quality, where appearance is put onto the right layer in a peeling-like fashion.

Strengths: -- The quality improves -- The approach os logical -- Writing explains the problem well and evaluates how trivial solutions do not work. It is easy to think semantic image synthesis is solved when seeing the fantastic still images or SIS on semantic maps from video games. But it is much less clear how to get true multi-view SIS as done here. -- A consistent progress in an important field

Weaknesses: -- This is not a totally surprising result -- The hoops because of the dense MPI and the memory limitations indicates there is need for a more elegant solution.

Correctness: I thought this to be correct. There are so many MVS and SIS operator, that it almost cannot be ruled out, that some of them will play along nicer than others. It also can be imagined that other, even simpler ways can be envisioned to solve this.

Clarity: Yes. -- What is the text in Fig. meaning to say? -- Please use proper vector fonts, the figures look aliased / with unreadabe text. -- The zoom in Fig. 1 is pretty meaningless, as the main image has no discernable change, while the insets that are supposed to show an artifact, are too small to see it. -- I did not understand the subtility about no GAN being allowed in MPI, etc. I have a rough idea of what the problem could be, but do not understand the solution. Probably the GAN has to work on a complete image, so the appearance is represented in the MPI as abstract features, not as RGB? -- Fig. 4 is horrible. Why is it that KITT etc is always shown whit such a confusing aspect? It is basically impossible to see right anything from those images. If space is of the essence, I would be happier to cut the image in the middle and only show the left half, but large and with proper aspect. Like this, you can as well simply show nothing. And no, zooming into a PDF is no solution. -- Fig. 5 is useless, too. It is impossible to make out any differences, yet to imagine what that would mean for view consistency. How about making epipolar slices to have an argument about consistency? They are not natural, but at leats we see something. -- Fig. 6 is even worse. It is too small to be taken serious.

Relation to Prior Work: Fine

Reproducibility: Yes

Additional Feedback: