NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2765
Title:Few-shot Video-to-Video Synthesis

Reviewer 1


		
This is a solid work that extended vid2vid to adapt examples from unseen domains. The key idea is the weight generation module that dynamically generates the weights of the network to transfer input examples from a new, unseen domain. While dynamic weight generation has been applied to a broad range of problems, this is the first work to address the vid2vid problem. The main downside of this work is the lack of details for the weight generation module: 1) If I understand correctly, the way the "style" of the new images is encoded in the beta vectors. The attention module compares the similarity of the poses in the testing examples to the current pose to be translated and produces a weight for combing the beta vectors (the new style from an unseen domain). It seems such an attention scheme is the key to outperform the alternatives. Please clarify if this is the case. 2) How It is not clear to me how many examples (the K parameter) is needed for training and testing. If attention is the key, then the more examples, the better. Does the method fail when K falls below a certain value? 3) It is not clear the comparison to the baselines are fair. It seems that all baselines use a single image to encode the "style" code or compute the parameters in AdaIN, whereas the proposed method uses K images. When it is true that the proposed method is unique in being able to fuse multiple reference images using attention, it is necessary to discuss this clearly in the paper. 4) It is also not clear that weight generation is necessary. Why not simply feed the weighted average appearance presentation (Line 154) to the generator, for example, via AdaIN layers. What is really the advantage of E_c here? Overall, the work addressed a very important problem and provided a novel solution. Although there are still artifacts, the results are very convincing from the perspective that shows clear improvement over the existing alternatives. The reason for not giving higher score is the lack of details and in-depth discussion for the network weight generation module, especially what actually makes the networks work-- the attention or the weight generation.

Reviewer 2


		
I would have liked to see more analysis of why the proposed "adaptation" approach helps to improve the quality of the results. Compared to the baselines and other methods, the empirical results in this paper are of a higher quality both quantitatively and qualitatively. It would have been interesting to show some failure modes and analyze why the failures happened - e.g. stretching the adaptivity to the limit. I don't find the notation in the paper very clear - it requires lots of rechecking to figure out what the different letters mean. If the paper is accepted, the authors should consider modifying their notation.

Reviewer 3


		
SUMMARY The paper tackles a few-shot generalization problem for video-to-video translation, where the goal is to synthesize videos of an unseen domain using few example images of the target domain. To this end the authors extend vid2vid with two modifications. First, they replace the image synthesis module with the recently proposed SPADE [41]. Second, they introduce an adaptive weight generation module which provides weight parameters to the SPADE model. This new module is implemented with a soft-attention mechanism to aggregate statistics from a few example images of the target domain. The proposed approach is evaluated on two tasks, motion retargeting and street scene style transfer. The authors compare with three baselines (implemented by the authors) as well as two existing work [1, 46]. The authors report both quantitative and qualitative results to demonstrate their approach. ORIGINALITY - The proposed adaptive weight generation module could be considered new. The idea of using a soft-attention mechanism for dynamic weight generation -- in particular, using the embedding a_t from the current input semantic image s_t as a key vector to retrieve relevant appearance statistics from example images e_{1:K} -- is intuitive and reasonable. Although attention-based encoding of image sets is not new, it is well used in this work. - It is unclear why the soft-attention based weight generation was necessary instead of, e.g., the AdaIN-based approach, as was done in FUNIT [31]. This needs to be better motivated -- Is AdaIN not applicable to the proposed approach, and if so, why? Why is it necessary to propose the attention-based method when existing approaches have already demonstrated the effectiveness of AdaIN in a similar setup (for images)? QUALITY - The quantitative results clearly show improvement over the compared baselines. The results on motion retargeting is particularly good, both qualitatively and quantitatively. - The authors do not discuss limitations/weaknesses of their method. It would've been nice if there was a discussion on this. CLARITY - The paper reads well overall. But some of important details are missing (see my comments below). SIGNIFICANCE - As mentioned above, the problem is of wide interest to this community and the proposed technique generalizes to different domains. The visual quality on motion retargeting is promising, and the code will be released. Given these, I think this work will likely to have a reasonable impact and could encourage follow ups.