{"title": "Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 3814, "page_last": 3824, "abstract": "We propose a deep video prediction model conditioned on a single image and an action class. To generate future frames, we first detect keypoints of a moving object and predict future motion as a sequence of keypoints. The input image is then translated following the predicted keypoints sequence to compose future frames. Detecting the keypoints is central to our algorithm, and our method is trained to detect the keypoints of arbitrary objects in an unsupervised manner.  Moreover, the detected keypoints of the original videos are used as pseudo-labels to learn the motion of objects. Experimental results show that our method is successfully applied to various datasets without the cost of labeling keypoints in videos. The detected keypoints are similar to human-annotated labels, and prediction results are more realistic compared to the previous methods.", "full_text": "Unsupervised Keypoint Learning\n\nfor Guiding Class-Conditional Video Prediction\n\nYunji Kim1, Seonghyeon Nam1, In Cho1, and Seon Joo Kim1,2\n\n{kim_yunji,shnnam,join,seonjookim}@yonsei.ac.kr\n\n1Yonsei University\n\n2Facebook\n\nAbstract\n\nWe propose a deep video prediction model conditioned on a single image and an\naction class. To generate future frames, we \ufb01rst detect keypoints of a moving object\nand predict future motion as a sequence of keypoints. The input image is then\ntranslated following the predicted keypoints sequence to compose future frames.\nDetecting the keypoints is central to our algorithm, and our method is trained to\ndetect the keypoints of arbitrary objects in an unsupervised manner. Moreover,\nthe detected keypoints of the original videos are used as pseudo-labels to learn\nthe motion of objects. Experimental results show that our method is successfully\napplied to various datasets without the cost of labeling keypoints in videos. The\ndetected keypoints are similar to human-annotated labels, and prediction results\nare more realistic compared to the previous methods.\n\n1\n\nIntroduction\n\nVideo prediction is a task of synthesizing future video frames from a single or few image(s), which\nis challenging due to the uncertainty of the dynamic motions in scenes. Despite its dif\ufb01culty, this\ntask has attracted great interests in machine learning, as predicting unknown future is fundamental to\nunderstanding video data and the physical world.\nEarly works in video prediction adopted deterministic models that directly minimize the pixel distance\nbetween the generated frames and ground-truth frames [1\u20134]. Srivastava et. al. [1] studied LSTM-\nbased model for video prediction and video reconstruction. Finn et. al. [2] generate the next frame by\npixel-wise transformation on the previous frame. Kalchbrenner et. al. [3] generate a future frame by\ncalculating the distribution of RGB values per pixel given prior frames. De Brabandere et. al. [4]\npropose a model that generates dynamic convolutional \ufb01lters for video and stereo prediction. These\ndeterministic models tend to produce blurry results, and also have a fundamental limitation in that\nthey have dif\ufb01culty in generating videos for novel scenes that they have not seen before. To overcome\nthese issues, recent approaches take on generative methods based on generative adversarial networks\n(GANs) [5] and variational auto-encoders (VAEs) [6], by using the adversarial loss of GANs and\nthe KL-divergence loss of VAEs as an additional training loss [7\u201310]. Babaeizadeh et. al. [7] extend\nthe work of Finn et. al. [2] by using the VAE as a backbone structure to generate various samples.\nMathieu et. al. [8] propose a GAN based model to handle blurry results induced by MSE loss. Lee\net. al. [9] introduced a model combining both GAN and VAE to generate sharp and various results.\nDenton et. al. [10] aim to generate various results by learning the conditional distribution of latent\nvariables that drive the next frame with the VAE as a backbone structure.\nAforementioned works fall into a black-box approach in Fig. 1 (a), where videos are directly\nsynthesized through spatio-temporal networks. This type of approach achieved limited success on\nfew simple datasets which have low variance such as the Moving MNIST [1], KTH human actions\n[11], and BAIR action-free robot pushing dataset [12].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Different types of video prediction algorithms. (a) predicts video using spatio-temporal\nmodels with a black-box approach. (b) utilizes human-annotated keypoints labels and uses it as a\nguidance for future frames generation. (c) is our proposed method that internally generates keypoints\nlabels by training the keypoints detector in an unsupervised manner. It also guides future frames with\nthe keypoints sequence.\n\nAs one can imagine, it is more dif\ufb01cult to generate videos than images as we need to represent\nthe temporal domain in addition to the spatial domain. To make the model to predict future with\nthe comprehension of this nature of videos, some works have attempted to train the model by\ndisentangling the spatial (contents) and the temporal (motion) characteristics of videos [13\u201316].\nTulyakov et. al. [13] proposed to generate videos with two random values, each representing the\ncontents and the motion feature of the video. The method of Villegas et. al. [14] predicts the next\nframe with latent motion feature related to multiple previous difference images. To better decompose\ntwo features, the works of [15, 16] impose adversarial loss on each feature. However, the results of\nthese works are similar to the deterministic methods in quality.\nMeanwhile, recent image translation works have shown that using keypoints is a promising approach\n[17\u201320]. In these works, keypoints are used as a guidance for the image translation leading to\nqualitative improvements of the results. This approach was extended for the video prediction task by\n[21\u201323], which fall into Fig. 1 (b). These methods generate future frames by translating a reference\nimage using the keypoints sequence as a guidance. The works in [21, 22] are prediction models\nthat utilize labels of human joint positions. Villegas et. al. [21] succeeded in generating long-term\nfuture image sequence and improving visual quality of the results by applying a method called\nvisual-structure analogy making based on the work of Reed et. al. [24]. Cai et. al. [22] proposed an\nintegrated model that is capable of video generation, prediction and completion task by optimizing\nlatent variables in accordance with given constraints. Wang et. al. [23] employed the VAE network\nto generate diverse samples and use a keypoints sequence for synthesizing a human face image\nsequence. These works suggest that using keypoints is effective for the video prediction task. They\nall produce high-quality results for natural scene datasets such as the Penn Action [25] and UCF-101\n[26]. However, these works require frame-by-frame keypoints labeling, which limits the applicability\nof the methods.\nA way to deal with this problem is to employ a keypoints detector trained in an unsupervised manner.\nSeveral models of this kind have recently been proposed [27\u201329]. The method of Thewlis et. al. [27]\nlearns to detect keypoints using a known transformation function between two images. Zhang\net. al. [28] proposed to \ufb01nd keypoints for image reconstruction and manipulation tasks. This model is\nbased on the VAE with the hourglass network [30], and imposes constraints on detected landmarks\nto enhance the validity of the results. Jakab et. al. [29] proposed an unsupervised approach to \ufb01nd\nkeypoints of an object that serve as the guidance in image translation task. The work uses a simple\nmethod called the heatmap bottleneck, showing the state-of-the-art keypoints detection performance\nwithout imposing any regularization. This type of keypoints detector was also studied for a video\ngeneration model that implants the motion of a source video to a static object of the target image by\nSiarohin et. al. [31], showing successful results on various video datasets.\nBuilding on ideas from previous works, we propose a deep video prediction model that includes a\nkeypoints detector trained in an unsupervised manner which is illustrated in Fig. 1 (c). Compared\nwith Fig. 1 (a) and (b), our model performs better on various datasets including the datasets without\nthe ground-truth keypoints labels, as our method learns the keypoints best suited for the video\nsynthesis without labels. Fig. 2 shows the overview of our method for predicting future frames at\nthe inference time. Our approach consists of 3 stages: keypoints detection, motion generation, and\nkeypoints-guided image translation. In our work, no labels except for the action class are demanded.\nGiven an input image and a target action class, our method \ufb01rst predicts the keypoints of the input\n\n2\n\nINPUTINFERENCET = 0, 1, 2T = 3, 4, 5INPUTINFERENCET = 0T = 1T = 2T = 3T = 4T = 5INPUTINFERENCET = 0T = 1T = 2T = 3T = 4T = 5\fFigure 2: The overview of our method at inference time. Our method generates future frames through\n3 stages: keypoints detection, motion generation, and keypoints-guided image translation.\n\nimage. Then, our method generates a sequence of keypoints starting from the predicted keypoints,\nwhich follows the given action. Finally, the output video is synthesized by translating the input image\nframe-by-frame using the generated keypoints sequence as a guidance. The key in our unsupervised\napproach is to use the keypoints of ground-truth videos detected from the keypoints detector as\npseudo-labels for learning the motion generator. Moreover, we propose a robust image translator\nusing the analogical relationship between the image and keypoints, and a background masking to\nsuppress the distraction from noisy backgrounds. Experimental results show that our method produces\nbetter results than previous works, even the ones that utilizes human-annotated keypoints labels. The\nperformance of the keypoints detector is greatly improved allowing our method to be applied to\nvarious datasets.\nThe summary of our contributions is as follows.\n\n\u2022 We propose a deep generative method for class-conditional video prediction from a single\nimage. Our method internally generates keypoints of the foreground object to guide the\nsynthesis of future motion.\n\n\u2022 Our method learns to generate a variety of keypoints sequences from data without labels,\nwhich enables our method to model the motion of arbitrary objects including human, animal,\nand etc.\n\n\u2022 Our method is robust to the noise of data such as distracting backgrounds, allowing our\n\nmethod to work robustly on challenging datasets.\n\n2 Method\nGiven a source image v0 \u2208 RH\u00d7W\u00d73 at t = 0 with a target action vector a \u2208 RC, the goal of our task\nis to predict future frames \u02c6v1:T \u2208 RT\u00d7H\u00d7W\u00d73 with T > 0, where the motion of a foreground object\nfollows the action code. To tackle this problem, our approach is to train a deep generative network,\nconsisting of a keypoints detector, a motion generator, and a keypoints-guided image translator.\nInstead of generating \u02c6v1:T at once, we \ufb01rst predict future motion of the object as a keypoints sequence\n\u02c6k1:T \u2208 RT\u00d7K\u00d72 and translate the input frame v0 with \u02c6k1:T as a guidance. The training process\nconsists of two stages: (i) learning the keypoints detector with the image translator and (ii) learning\nthe motion generator. In the following, we describe our network and its training method in detail.\n\nLearning the keypoints detector with the image translator. Fig. 3 shows our method for the\nimage translation employing the keypoints detector and the keypoints-guided image translator.\nInspired by [29], our method learns to detect the keypoints of a foreground object by learning the\nimage translation between two frames (v, v(cid:48)) in the same video. The intuition behind learning\nthe keypoints in this way is that translating v close to v(cid:48) enforces the network to automatically\n\ufb01nd the most dynamic parts of the image, which can then be used as the guidance to move the\nobject in the reference image. Different from [29], the target image is synthesized by inferring the\nanalogical relationship [21, 24] between the keypoints and the image, where the difference between\nthe reference and the target image (v, v(cid:48)) corresponds to the difference between the two detected\nkeypoints sets, (\u02c6k, \u02c6k(cid:48)).\n\n3\n\nKeypointsDetectorMotion GeneratorTranslatorAction\u22f1Input ImageInitial KeypointsAction conditioned image sequenceAction conditionedkeypointssequencez ~ \ud835\udc41(0,\ud835\udc3c)\f(a)\n\n(b)\n\nFigure 3: The overview of training the keypoints detector and the image translator. (a) shows the the\nunsupervised learning of the keypoints by learning the image translation. (b) shows the detail of our\nbackground masking method.\nThe keypoints detector Q \ufb01nds K keypoints of the input image. The keypoints coordinates \u02c6k \u2208 RK\u00d72\nare obtained by calculating the expected coordinates of the K-channel soft binary map l \u2208 RH\u00d7W\u00d7K,\nwhich is the last feature map of Q followed by a softmax activation described as\n\neQ(v)n(cid:80)\n(cid:88)\n\nu eQ(v)n\nu \u00b7 ln\nu,\n\nu\n\nu\n\nln =\n\n\u02c6kn =\n\n,\n\n(1)\n\nwhere \u02c6kn is the coordinates of the n-th keypoint and u is the pixel coordinates. The detected keypoints\n\u02c6k are then normalized to have values between -1 and 1, and converted to K gaussian distribution\nmaps d \u2208 Rh\u00d7w\u00d7K using the following formulation:\n\n\u2212(u(cid:48)\u2212\u02c6kn)2(cid:46)\n\n\u02c6dn\n\nu(cid:48) =\n\n\u221a\n1\n2\u03c0\n\n\u03c3\n\ne\n\n2\u03c32\n\n,\n\n(2)\n\nwhere \u03c3 is the standard deviation of a Gaussian distribution.\nOur image translation network T only handles dynamic regions by generating image s \u2208 RH\u00d7W\u00d73\nwith a new appearance of the object and a soft background mask m \u2208 RH\u00d7W\u00d71 similar to [32].\nThen, we smoothly blend the input image v and synthesized image s using the background mask m\nwhich is described as\n\nm, s = T (v; \u02c6k; \u02c6k(cid:48))\n\n\u02c6v = m (cid:12) v + (1 \u2212 m) (cid:12) s,\n\n(3)\n\n(4)\n\nwhere (cid:12) refers to a Hadamard product of two tensors.\nThe training objective for Q and T consists of a reconstruction loss de\ufb01ned by the distance between\nthe output and target image, and an adversarial loss [5] that leads our model to produce realistic\nimages. We use the perceptual loss [33] based on the VGG-19 network [34] pretrained for image\nrecognition task [35] as the reconstruction loss to enforce perceptual similarity of the generated image\nand the target image.\nHence, our optimization is to alternately minimize the two losses de\ufb01ned as follows:\n\nLDim = \u2212 log Dim(v(cid:48)) \u2212 log(1 \u2212 Dim(\u02c6v))\n\nLQ,T = \u2212 log Dim(\u02c6v) + \u03bb1El(cid:107)\u03a6l(\u02c6v) \u2212 \u03a6l(v(cid:48))(cid:107),\n\nwhere Dim is the image discriminator, \u03a6l is the l-th layer of the VGG-19 network, and \u03bb1 is the\nweight of the perceptual loss.\n\nLearning the motion generator with pseudo-labeled data. Our method for the motion generation\nis shown in Fig. 4. After completing the \ufb01rst training stage, we can detect keypoints of any image\nand translate image with arbitrary target keypoints. With the trained keypoints detector, we prepare\npseudo-labels \u02c6k0:T by detecting keypoints of real videos and use them to train our motion generator\nM to generate sequences of future keypoints, which is used as a guidance for synthesizing future\nframes \u02c6v1:T .\nWe build our motion generator upon a conditional variational auto-encoder (cVAE) [36] to learn the\ndistribution of future events with the given conditions. Speci\ufb01cally, our motion generator learns to\n\n4\n\nKeypointsDetectorTarget KeypointsReferenceKeypointsReferenceImageTarget ImageSynthesizedImageBackgroundMaskTranslator*+*Converted MaskTranslatedImageSynthesizedImageBackgroundMaskReferenceImage\fFigure 4: The overview of training the motion generator. Our motion generator is built upon cVAE\nframework conditioned on the initial keypoints and the action class. We utilize detected keypoints\nsequence of real videos to learn the motion of arbitrary objects.\nencode the pseudo-labels to normally distributed latent variables q\u03c6(z|\u02c6k1:T , \u02c6k0, a), and to decode z\nback to the corresponding keypoints sequence p\u03b8(\u02c6k1:T|z, \u02c6k0, a). To handle the sequential data, an\nLSTM network [37] was used for both the encoder and the decoder. At inference time, the future\nmotion is predicted by random sampling of z value from N (0, I), which leads M to generate many\npossible results.\nThe network is trained by optimizing the variational lower bound [6] that is comprised of the KL-\ndivergence and the reconstruction loss. We additionally trained the keypoints sequence discriminator\nDseq, as we have found that using an adversarial loss [5] on the cVAE model improves the quality of\nthe results.\nOur training of M is to alternately minimize the following two objectives:\nLDseq = \u2212 log Dseq(\u02c6k1:T ) \u2212 log(1 \u2212 Dseq(\u02dck1:T ))\n\nLM = DKL(q\u03c6(z|\u02c6k1:T ; \u02c6k0; a)(cid:107)pz(z)) + \u03bb2(cid:107)\u02dck1:T \u2212 \u02c6k1:T(cid:107)1 \u2212 \u03bb3 log Dseq(\u02dck1:T ),\n\n(5)\n\nwhere \u02dck is the reconstruction of keypoints, and \u03bb2 and \u03bb3 are hyperparameters. The prior distribution\nof latent variables pz(z) is set as N (0, I).\n\n3 Experiments\n\n3.1 Datasets\n\nPenn Action This dataset [25] consists of videos of human in sports action. The total number of\nvideos is 2326, and the number of action class is 15. We only used videos that show the whole body\nof a foreground actor and excluded classes with too few samples. Hence, out of 15 action classes,\nwe only used 9 classes \u2013 baseball pitch, clean and jerk, pull ups, baseball swing, golf swing, tennis\nforehand, jumping jacks, tennis serve, and squats. Due to the lack of data, only 10 samples per\neach class were used as the test set and the rest as the training set, making sure that there are no\noverlapping scenes in the training and test sets. The \ufb01nal dataset consists of 1172 training videos\nand 90 test videos. During the training process, data was intensely augmented by random horizontal\n\ufb02ipping, random rotation, random image \ufb01lter, and random cropping.\n\nUvA-NEMO and MGIF The UvA-NEMO [38] consists of 1234 videos of smiling human faces,\nwhich is split into 1110 videos for the training set and 124 for the evaluation set. The MGIF [31] is\na dataset consisting of videos of cartoon animal characters simply walking or running on a white-\ncolored background. For this dataset, 900 videos are used for the training and 100 videos are used for\nthe evaluation. We used the pre-processed version of both datasets provided by Siarohin et. al. [31],\nand applied the same augmentation methods used for the Penn Action dataset.\n\nImplementation details\n\n3.2\nThe resolution of both the input and the output images is 128\u00d7128, and the number of keypoints K\nwas set to 40, 15, and 60 for the Penn Action, UvA-NEMO, and MGIF dataset, respectively. We\nimplemented our method using TensorFlow with the Adam optimizer [41], the learning rate of 0.0001,\n\n5\n\nKeypointsSequenceDiscriminatorReal?Fake?\u22f1InitialKeypointsInputImageKeypointsDetectorGenerated keypointssequenceMotion GeneratorEncoderDecoder~\ud835\udc41(0,\ud835\udc3c)\u22f1Real image sequence\u22f1KeypointsDetector\u22f1ActionReal keypointssequence\fDataset\n\nPenn Action [25]\nUvA-NEMO [38]\n\nMGIF [31]\n\n[39]\n\n4083.3\n666.9\n683.1\n\n[16]\n\n3324.9\n265.2\n1079.6\n\n[21]\n\n2187.5\n\n-\n-\n\nOurs\n\n1509.0\n162.4\n409.1\n\nTable 1: Fr\u00e9chet Video Distance (FVD) [40] of generated videos. On every datasets, our method\nachieved the best score. (The lower is better.)\n\nthe batch size of 32, and the two momentum values of 0.5 and 0.999. We decreased the learning rate\nby 0.95 for every 20,000 iterations. The keypoints detector and the translator were optimized until the\nconvergence of the perceptual loss, and the motion generator until the KL-divergence convergence.\nConsidering the tendency of the convergence, \u03bb1, \u03bb2, and \u03bb3 were set to 1, 1000, and 2, respectively.\nSince the UvA-NEMO and MGIF datasets consist of videos of same action, only initial keypoints k0\nare set as the condition for the motion generation. 1\n\n3.3 Baselines\n\nWe compare our method with three baselines [16, 39, 21], all of which produce future frames in\ntwo stages predicting the guiding information \ufb01rst. The method of Wichers et. al. [16] generates\nframes with the latent motion feature, which is learned with an adversarial network. The works of\nVillegas et. al. [21] and Li et. al. [39] respectively guide frame generation with keypoints and optical\n\ufb02ow. Each model utilizes keypoints labels and pretrained optical \ufb02ow predictor. Only the work of\nLi et. al. [39] is conditioned on a single frame like ours, while others [16, 21] are conditioned on\nmultiple prior frames. For implementing the baseline models, we used the codes released by the\nauthors maintaining original settings of each model, including the number of conditional images,\nimage resolution and, the number of future frames.\n\n3.4 Results\n\nQualitative results Video prediction results of our method on Penn Action dataset is shown in\nFig. 5 (a). The generated videos present both the realistic image per frame and the plausible motion\ncorresponding to the target action class. The synthesized image and mask sequence imply that\nour model disentangles dynamic regions well, and the predicted keypoints sequence is similar to\nhuman-annotated labels. Comparison of the results are shown in Fig. 5 (b). Since the number of\ngenerated frames varies from model to model, we sampled 8 frames from each result that represent\nthe whole sequence for the qualitative comparison. The results imply that our method achieved\nimprovements in both the visual and the dynamics quality compared to the baselines. The work of\nWichers et. al. [16] failed to make realistic and dynamic future frames, although it is capable of\ndistinguishing moving objects to some extent. The method [39] struggles with the error propagation\nsince they apply the warp operation with the predicted optical \ufb02ow sequence. The results of Villegas\net. al. [21] are the best among the baselines, showing plausible and dynamic motion. However, the\nresults of our model are more visually realistic, as it employs the keypoints speci\ufb01cally learned for\nthe image synthesis. Moreover, our model can generate various results with only one image as shown\nin Fig. 6, by randomly sampling the z value and changing the target action class.\nFig. 7 shows our prediction results on the UvA-NEMO and the MGIF datasets. The work of [21] was\nnot compared since these datasets have no keypoints labels. The results show that the work of Li\net. al. [39] still has the error propagation issue on both datasets due to the warping operation. The\nmethod of Wichers et. al. [16] failed on the MGIF dataset, but succeeded at generating plausible\nfuture frames on the UvA-NEMO dataset. Meanwhile, our method generates frames with dynamic\nmotion while maintaining the visual attributes of the foreground object over the whole sequence.\nIn addition to the video prediction results, we also demonstrate the performance of our image\ntranslator in Fig. 8. Examples are chosen to show different capabilities of our image translator:\n(a) translation, (b) inpainting, and (c) object removal. The result in (a) suggests that the reference\nimage is well translated by detected keypoints. The synthesized mask and the image imply that\nour model focuses on \ufb01lling in occluded or disoccluded regions, separating the foreground region\n\n1The architectural details of our model are demonstrated in the supplementary material.\n\n6\n\n\fInput\n\nAction, Image\n\nPull ups\n\nT=4\n\nT=8\n\nT=12\n\nT=16\n\nT=20\n\nT=24\n\nT=28\n\nT=32\n\nFuture sequence\n\nReal\n\nPrediction\n\nSynthesized image\n\nBackground Mask\n\nKeypoints\n\nFuture sequence\n\n(a)\n\n(b)\n\nInput\n\nAction, Image\n\nBaseball pitch\n\nReal\n\nOurs\n\n[21]\n\n[16]\n\n[39]\n\nFigure 5: Video prediction results on the Penn Action dataset. In (a), the input image and the target\naction are shown on the left side. On the right side, the ground-truth video, synthesized video after\nmasking, synthesized video before masking, background mask, and keypoints are shown from the\ntop. (b) compares the result of ours with the baseline methods.\n\nInput\n\nFuture sequence\n\nAction, Image\n\nT=8\n\nT=16\n\nT=24\n\nT=32\n\nInput\n\nAction, Image\n\nFuture sequence\n\nT=8\n\nT=16\n\nT=24\n\nT=32\n\nTennis serve\n\nClean and jerk\n\nBaseball pitch\n\nTennis serve\n\nReal\n\nPrediction #1\n\nPrediction #2\n\n(a)\n\n(b)\n\nFigure 6: Variety of prediction results. The examples in (a) are induced by the random sampling of z\nvalue in the motion generator, and (b) by the change of the target action class a.\n\n7\n\n\fInput\n\nFuture sequence\n\nInput\n\nFuture sequence\n\nReal\n\nOurs\n\n[16]\n\n[39]\n\nReal\n\nOurs\n\n[16]\n\n[39]\n\nFigure 7: Video prediction results on UvA-NEMO and MGIF datasets.\n\n(i)\n\n(ii)\n\n(iii)\n\n(iv)\n\n(v)\n\n(vi)\n\n(vii)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 8: Image translation results. Columns represent\nthe following in order \u2013 reference image, target image,\ndetected keypoints of reference/target images, background\nmask, synthesized image, and \ufb01nal translation result. The\nsamples in rows (a)-(c) show that our image translator is\ncapable of different tasks including translation, inpainting,\nand object removal.\n\nMethod\n\nOurs\n\nOurs w/o a\n\n[21]\n[16]\n[39]\n\nAccuracy\n\n68.89\n63.33\n47.14\n40.00\n15.55\n\nTable 2: Action recognition accuracy.\n\nMethod\n\nOurs\n[21]\n[16]\n[39]\n\nRanking\n\n1.81 \u00b1 1.02\n2.44 \u00b1 0.98\n3.14 \u00b1 1.09\n2.61 \u00b1 0.96\n\nTable 3: Quantitative result of the\nuser study. The values refer to av-\nerage rankings.\n\nsharply. Interestingly, the results in (b) and (c) show that our model learned additional abilities to\n\ufb01ll or remove parts of the image, when there are no corresponding objects. All these imply that our\nmodel has learned the robust ability to discern moving objects as keypoints.\n\nQuantitative results For quantitative comparison, we used the Fr\u00e9chet video distance (FVD) [40].\nThis is Fr\u00e9chet distance between the feature representations of real and generated videos. The feature\nrepresentations were gained from the I3D model [42] trained on kinetic-400 [43]. The results are\nreported in Table 1. Our method achieved the smallest FVD values on every datasets. This implies\nthat our method generates more realistic videos compared to the baseline methods.\nIn addition, we assess the plausibility of generated motion. Since it is obvious what action the object\nwould take from the conditional image(s), we compared the action recognition accuracy on the\nresults using the two-stream CNN [44] that is \ufb01ne-tuned on the Penn Action dataset. We additionally\ncompared the results of our method without the conditional term for the target action class for fair\ncomparison. Our method achieved the best recognition score as shown in Table 2. Even though\nremoving the action class condition slightly affects the performance, the gap is small compared to\nthe baseline results. This implies that our method does a better job of generating plausible motion\ncompared to the other baseline approaches.\nWe also conducted a user study on Amazon Mechanical Turk (AMT), since above methods cannot\nfully re\ufb02ect the human perception on the visual quality of the results. We compared 70 of the 90\nprediction results on the Penn Action dataset, since [21] was trained for different set of action classes.\n\n8\n\n\fInput\n\nv\n\nv(cid:48)\n\nk\n\n-\n\n(a)\n\n(b)\n\n(c)\n\nFigure 9: The results of the keypoints-guided\nimage translation from (a) the baseline method\n[29], (b) our network without the mask, and\n(c) our network. Our method achieved perfor-\nmance improvement in both the keypoints de-\ntection and the image translation compared to\nthe baseline method.\n\nk(cid:48)\n\nOutput\n\n\u02c6v\n\nInput\n\nFuture sequence\n\nAction, Image\n\nT=8\n\nT=16\n\nT=24\n\nT=32\n\nTennis\nserve\n\nTennis\nforehand\n\nReal\n\nPrediction\n\nReal\n\nPrediction\n\nFigure 10: Failure cases.\n\n15 workers were asked to rank the results generated by different methods based on the visual quality\nand the degree of the movement in foreground region. During the process, workers were shown four\nvideos side by side, where the order of videos was randomly chosen for each vote. The averaged\nrankings for all methods are shown in Table 3, which indicates that our method outperforms all the\nbaselines in both aspects, even though it has been trained without any labels.\n\nComponent analysis Our keypoints-guided image translation method achieved improvement in\nperformance compared to the original work [29] by (i) learning the analogical relationship between\nthe keypoints and the image and (ii) generating a background mask. We analyzed the effect of each\ncomponent as shown in Fig. 9.\nComparing the results in Fig. 9 (a) and (b), the performance of the keypoints detector has improved\nwhen the process employs the reference keypoints in addition to the target keypoints. With keypoints\nin both the images and the reference image, the translator can synthesize the foreground object in the\ntarget pose by inferring the analogical relationship like \"A is to B as C is to what?\". If the reference\nkeypoints are not considered, the translator would have to \ufb01nd the region to translate independently\nwhich is redundant and inef\ufb01cient set-up. The result in Fig. 9 (c) shows that incorporating the back-\nground mask generation into the keypoints-guided image translation led to signi\ufb01cant improvement\nin quality of the translated image. The mask generation is effective when only a speci\ufb01c part of\nthe image needs to be translated, since synthesizing only the foreground object is bene\ufb01cial for the\nnetwork to fool the image discriminator by reducing the complexity of the modeling compared to\nsynthesizing the entire scene. Achieving these improvements in the keypoints detection and the image\ntranslation, our method could be applied to various datasets successfully.\n\nFailure Cases We found two cases in which our model failed to generate plausible future frames.\nThe \ufb01rst case is from the failure of the keypoints detector when there are multiple objects with similar\nsize in the input image (\ufb01rst example in Fig. 10). This failure causes a series of failures in the motion\ngeneration and the image translation. The second example in Fig. 10 shows the other failure case.\nSince our keypoints detector works in a body orientation agnostic way, object moves in the opposite\ndirection from our expectations in some cases.\n\n4 Conclusion\n\nIn this paper, we proposed an action-conditioned video prediction method using a single image.\nInstead of generating future frames at once, we \ufb01rst predict the temporal propagation of the foreground\nobject as a sequence of keypoints. Following the motion of the keypoints, input image is translated\nto compose future frames. Our network is trained in an unsupervised manner using the predicted\nkeypoints of the original videos as pseudo-labels to train the motion generator. Experimental results\nshow that our method achieved signi\ufb01cant improvement on visual quality of the results and is\nsuccessfully applied to various datasets without using any labels.\n\n9\n\n\fAcknowledgement This work was supported by Samsung Research Funding Center of Samsung Electronics\nunder Project Number SRFC-IT1701-01.\n\nReferences\n[1] N. Srivastava, E. Mansimov, and R. Salakhudinov, \u201cUnsupervised learning of video representations using\n\nlstms,\u201d in ICML, 2015.\n\n[2] C. Finn, I. Goodfellow, and S. Levine, \u201cUnsupervised learning for physical interaction through video\n\nprediction,\u201d in NeurIPS, 2016.\n\n[3] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu,\n\n\u201cVideo pixel networks,\u201d in ICML, 2017.\n\n[4] B. de Brabandere, X. Jia, T. Tuytelaars, and L. van Gool, \u201cDynamic \ufb01lter networks,\u201d in NeurIPS, 2016.\n\n[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,\n\n\u201cGenerative adversarial nets,\u201d in NeurIPS, 2014.\n\n[6] D. Kingma and M. Welling, \u201cAuto-encoding variational bayes,\u201d in ICLR, 2014.\n\n[7] M. Babaeizadeh, C. Finn, D. Erhan, R. Campbell, and S. Levine, \u201cStochastic variational video prediction,\u201d\n\nin ICLR, 2018.\n\n[8] M. Mathieu, C. Couprie, and Y. LeCun, \u201cDeep multi-scale video prediction beyond mean square error,\u201d in\n\nICLR, 2016.\n\n[9] A. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, \u201cStochastic adversarial video prediction,\u201d in\n\narXiv:1804.01523, 2018.\n\n[10] E. Denton and R. Fergus, \u201cStochastic video generation with a learned prior,\u201d in ICML, 2018.\n\n[11] C. Schuldt, I. Laptev, and B. Caputo, \u201cRecognizing human actions: a local svm approach,\u201d in ICPR, 2004.\n\n[12] F. Ebert, C. Finn, A. Lee, and S. Levine, \u201cSelf-supervised visual planning with temporal skip connections,\u201d\n\nin CoRL, 2017.\n\n[13] S. Tulyakov, M. Liu, X. Yang, and J. Kautz, \u201cMoCoGAN: Decomposing motion and content for video\n\ngeneration,\u201d in CVPR, 2018.\n\n[14] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, \u201cDecomposing motion and content for natural video\n\nsequence prediction,\u201d in ICLR, 2017.\n\n[15] E. Denton and V. Birodkar, \u201cUnsupervised learning of disentangled representations from video,\u201d in NeurIPS,\n\n2017.\n\n[16] N. Wichers, R. Villegas, D. Erhan, and H. Lee, \u201cHierarchical long-term video prediction without supervi-\n\nsion,\u201d in ICML, 2018.\n\n[17] G. Balakrishnan, A. Zhao, A. Dalca, F. Durand, and J. Guttag, \u201cSynthesizing images of humans in unseen\n\nposes,\u201d in CVPR, 2018.\n\n[18] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. van Gool, \u201cPose guided person image generation,\u201d\n\nin NeurIPS, 2017.\n\n[19] S. Reed, A. van den Oord, N. Kalchbrenner, S. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. de Freitas,\n\n\u201cParallel multiscale autoregressive density estimation,\u201d in ICML, 2017.\n\n[20] C. Chan, S. Ginosar, T. Zhou, and A. Efros, \u201cEverybody dance now,\u201d in arXiv:1808.07371, 2018.\n\n[21] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, \u201cLearning to generate long-term future via\n\nhierarchical prediction,\u201d in ICML, 2017.\n\n[22] H. Cai, C. Bai, Y. Tai, and C. Tang, \u201cDeep video generation, prediction and completion of human action\n\nsequences,\u201d in ECCV, 2018.\n\n[23] W. Wei, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci, and N. Sebe, \u201cEvery smile is unique: Landmark-\n\nguided diverse smile generation,\u201d in CVPR, 2018.\n\n[24] S. Reed, Y. Zhang, Y. Zhang, and H. Lee, \u201cDeep visual analogy-making,\u201d in NeurIPS, 2015.\n\n10\n\n\f[25] W. Zhang, M. Zhu, and K. Derpanis, \u201cFrom actemes to action: A strongly-supervised representation for\n\ndetailed action understanding,\u201d in ICCV, 2013.\n\n[26] K. Soomro, A. Zamir, and M. Shah, \u201cUcf101: A dataset of 101 human actions classes from videos in the\n\nwild,\u201d in CoRR, 2012.\n\n[27] J. Thewlis, H. Bilen, and A. Vedaldi, \u201cUnsupervised learning of object landmarks by factorized spatial\n\nembeddings,\u201d in ICCV, 2017.\n\n[28] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee, \u201cUnsupervised discovery of object landmarks as\n\nstructural representations,\u201d in CVPR, 2018.\n\n[29] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi, \u201cUnsupervised learning of object landmarks through\n\nconditional image generation,\u201d in NeurIPS, 2018.\n\n[30] A. Newell, K. Yang, and J. Deng, \u201cStacked hourglass networks for human pose estimation,\u201d in ECCV,\n\n2016.\n\n[31] A. Siarohin, S. Lathuili\u00e8re, S. Tulyakov, E. Ricci, and N. Sebe, \u201cAnimating arbitrary objects via deep\n\nmotion transfer,\u201d in CVPR, 2019.\n\n[32] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim, \u201cUnsupervised attention-guided image-\n\nto-image translation,\u201d in NeurIPS, 2018.\n\n[33] J. Johnson, A. Alahi, and L. Fei-Fei, \u201cPerceptual losses for real-time style transfer and super-resolution,\u201d\n\nin ECCV, 2016.\n\n[34] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image recognition,\u201d in\n\narXiv:1409.1556, 2014.\n\n[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. Berg, and L. Fei-Fei, \u201cImagenet large scale visual recognition challenge,\u201d in IJCV, 2015.\n\n[36] K. Sohn, X. Yan, and H. Lee, \u201cLearning structured output representation using deep conditional generative\n\nmodels,\u201d in NeurIPS, 2015.\n\n[37] S. Hochreiter and J. Schmidhuber, \u201cLong short-term memory,\u201d in Neural Comput., 1997.\n\n[38] H. Dibeklio\u02d8glu, A. Salah, and T. Gevers, \u201cAre you really smiling at me? spontaneous versus posed\n\nenjoyment smiles,\u201d in ECCV, 2012.\n\n[39] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang, \u201cFlow-grounded spatial-temporal video prediction\n\nfrom still images,\u201d in ECCV, 2018.\n\n[40] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, \u201cTowards accurate\n\ngenerative models of video: A new metric & challenges,\u201d in arXiv:1812.01717, 2018.\n\n[41] D. P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[42] J. Carreira and A. Zisserman, \u201cQuo vadis, action recognition? a new model and the kinetics dataset,\u201d in\n\nCVPR, 2017.\n\n[43] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back,\nP. Natsev, M. Suleyman, and A. Zisserman, \u201cThe kinetics human action video dataset,\u201d in arXiv:1705.06950,\n2017.\n\n[44] K. Simonyan and A. Zisserman, \u201cTwo-stream convolutional networks for action recognition in videos,\u201d in\n\nNeurIPS, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2092, "authors": [{"given_name": "Yunji", "family_name": "Kim", "institution": "Yonsei University"}, {"given_name": "Seonghyeon", "family_name": "Nam", "institution": "Yonsei University"}, {"given_name": "In", "family_name": "Cho", "institution": "Yonsei University"}, {"given_name": "Seon Joo", "family_name": "Kim", "institution": "Yonsei University / Facebook"}]}