NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2092
Title:Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction

Reviewer 1

Originality: The proposed method is an extension of the work from [20] for generating action-conditioned videos. The authors propose a modification of the landmark discovery pipeline to also generate foreground masks for the object moving in the video. Once the landmark discovery network is trained, the generated landmarks are used for supervision to train a stochastic sequence generation model. To the best of my knowledge, this is the first model to achieve this without the use of landmark labels. Quality: The paper writing needs work. There are many grammar errors throughout the paper that I noticed (e.g., line 199: detected keypoint are -> detected keypoints are). I recommend the authors to fix these in the next revision. The illustration of results are good and showcases the quality of the generated videos, and also, the supplementary material is very helpful to determine the spatio temporal quality of the generated videos. I would have, however, liked to see side-to-side video comparisons with the baselines. Clarity: The method explanation is clear, however, I feel the way the mechanical turk evaluation setup is explained could be improved. Significance: The quantitative and qualitative results given the unsupervised nature of the method makes the video generation results highly significant. In addition, the applications for video inpainting, object removal and pose translation can be of interest to researchers in the area of image/video editing

Reviewer 2

Pros: • The paper is well written and easy to follow. • The proposed approach uses only a single reference image to produce long range prediction in the future. • The model is object class agnostic, in the sense that it does not uses the class information of the foreground object to construct the key-point sequences. • The method is robust to moving background, as claimed by the authors and supported by few qualitative results. • The visual results are better than the compared approaches. • Key-point learning without any external labels appears to be a novel addition. Cons/Questions: • Even though the authors claim that the method is robust to moving background, what happens if the background contains motion from similar objects but in different directions? For example, consider a street scene where the background may contain several cars and/or pedestrians moving in all directions. • Is this method applicable for scenes with multiple foreground objects as well? These objects can belong to same or different classes. • In lines 110-115, the method hallucinates the foreground in the future based on the single reference image given. Also, the background information is carried forward from this reference input. How does the model handles situations where the foreground as well as the background is moving (for example, a sports scene such as skating or a moving camera)? In these cases, the background will not be similar to the one in the reference input in case of long-term future. • The use of a small pool of AMT human workers to rank the models is not a reproducible method for future works. Even though this is an unbiased ranking, a quantitative study involving traditional metrics like PSNR/SSIM etc. would help other models to gauge their performance against the proposed one. What is the rank of ground truth for such human evaluation ? That could have acted as a solid reference/benchmark to compare in relative scale. • Is the model end-to-end trainable? • Failure cases not discussed Also, strangely the sample videos in supple doc, have identical aliasing artifacts/degradation uniformly through all the frames. This is not consistent with more degradation as time progresses - which is evident ofcourse in frames of prediction given in paper & supple-doc. Another noticeable fact is that your prediction is not in sync with the ground truth, as far as the pose of the moving object is concerned. Is this the result of disentangling motion and content ? There has been lot of work on video frame prediction recently and many of them use the idea of disentangling motion and content. Also, it will be better to compare the model performance using auxiliary metrics such as action recognition accuracy on the produced output. Also, what about cross-dataset evaluation? In Sec. 3.2, its not clear what you trained your system for Nemo and Mgif data. Also the use of trivial adversarial loss functions used; This does not add value to CNN or GAN literature. Recent papers on video prediction have used other forms of objective functions. Why not try a few variants.

Reviewer 3

SUMMARY This paper proposes an action-conditional video prediction approach based on unsupervised keypoint detection. Given an input image and an action category label, the approach takes three steps to generate an output video: It (1) detects keypoints from an input image; (2) generates a sequence of future keypoints of the given action category; and (3) translates each keypoint frame into the target RGB image. Each step uses a dedicated neural network. The authors design a two-stage training strategy: First, the keypoint detector and the translator are trained on pairs of RGB images (in an unsupervised manner). The motion generator is then trained on video samples while fixing the other two networks. The authors demonstrate their approach on three datasets, Penn Action, UvA-NEMO and MGIF, comparing with three recently proposed approaches [12, 17, 19]. ORIGINALITY The proposed approach eliminates the need for keypoint-labeled datasets through unsupervised learning. This makes it generalizable to a variety of different scenarios where the ground-truth keypoints are not easy to obtain. This is conceptually attractive and will likely to encourage more research in this direction. The authors make two modifications to [20] for their keypoint detection network: (1) Instead of using keypoints of the target frame only, use keypoints of both input and target images; (2) Instead of predicting the target RGB image directly, predict a foreground image and a mask image, and then blended them with the input image to produce the target image. While the modifications are reasonable, it is unclear whether they actually improves the quality of keypoint detection results compared to [20]. Also, mask-based image generation is not entirely new; several existing approaches already adapt the same idea, e.g., [2] and vid2vid [Wang et al., 2018]. Overall, the proposed approach is a combination of well-known techniques for a novel scenario. The technical novelty seems somewhat incremental. QUALITY The design of the proposed approach is reasonable. However, it is unclear why the modifications to [20] were necessary; the authors do not show ablation results to justify their modifications. The approach is demonstrated only qualitatively using a limited number of examples and through a human evaluation. While the provided examples in the main paper and the supplementary look promising, it would've been far more convincing if the authors have provided more examples of different scenarios, e.g., providing the same input image with different class labels, varying the magnitude of the random vector z to show the network generating different videos, etc. At its current form, the results are not that much convincing. The UvA-NEMO dataset contains videos of smiling people with subtle differences, e.g., spontaneous vs. posed. Therefore, it is difficult to see if the results actually model the correct action class. Why not use other datasets that provide facial expressions with more dramatic differences across classes? For example, the MUG dataset contains videos from 6 categories of emotion (anger, disgust, fear, happy, sad, surprise). CLARITY The paper reads well overall, although there are a few typos and grammatical errors. I think Figures 9 and 10 are not so much informative. It would've been better if they were moved to the supplementary (with accompanying demo videos) and instead the authors showed results from an ablation study. SIGNIFICANCE Keypoint-based video prediction has become popular in the literature. This work contributes to the literature by showing the potential of unsupervised keypoint learning, eliminating the need for keypoint-labeled datasets. I like the direction of this paper and the proposed approach sounds reasonable. However, I have some concerns on insufficient experimental results.