{"title": "Generating Videos with Scene Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 613, "page_last": 621, "abstract": "We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation.", "full_text": "Generating Videos with Scene Dynamics\n\nCarl Vondrick\n\nMIT\n\nHamed Pirsiavash\n\nUMBC\n\nAntonio Torralba\n\nMIT\n\nvondrick@mit.edu\n\nhpirsiav@umbc.edu\n\ntorralba@mit.edu\n\nAbstract\n\nWe capitalize on large amounts of unlabeled video in order to learn a model of\nscene dynamics for both video recognition tasks (e.g. action classi\ufb01cation) and\nvideo generation tasks (e.g. future prediction). We propose a generative adversarial\nnetwork for video with a spatio-temporal convolutional architecture that untangles\nthe scene\u2019s foreground from the background. Experiments suggest this model can\ngenerate tiny videos up to a second at full frame rate better than simple baselines,\nand we show its utility at predicting plausible futures of static images. Moreover,\nexperiments and visualizations show the model internally learns useful features for\nrecognizing actions with minimal supervision, suggesting scene dynamics are a\npromising signal for representation learning. We believe generative video models\ncan impact many applications in video understanding and simulation.\n\n1\n\nIntroduction\n\nUnderstanding object motions and scene dynamics is a core problem in computer vision. For both\nvideo recognition tasks (e.g., action classi\ufb01cation) and video generation tasks (e.g., future prediction),\na model of how scenes transform is needed. However, creating a model of dynamics is challenging\nbecause there is a vast number of ways that objects and scenes can change.\nIn this work, we are interested in the fundamental problem of learning how scenes transform with\ntime. We believe investigating this question may yield insight into the design of predictive models for\ncomputer vision. However, since annotating this knowledge is both expensive and ambiguous, we\ninstead seek to learn it directly from large amounts of in-the-wild, unlabeled video. Unlabeled video\nhas the advantage that it can be economically acquired at massive scales yet contains rich temporal\nsignals \u201cfor free\u201d because frames are temporally coherent.\nWith the goal of capturing some of the temporal knowledge contained in large amounts of unlabeled\nvideo, we present an approach that learns to generate tiny videos which have fairly realistic dynamics\nand motions. To do this, we capitalize on recent advances in generative adversarial networks [9, 31, 4],\nwhich we extend to video. We introduce a two-stream generative model that explicitly models the\nforeground separately from the background, which allows us to enforce that the background is\nstationary, helping the network to learn which objects move and which do not.\nOur experiments suggest that our model has started to learn about dynamics. In our generation\nexperiments, we show that our model can generate scenes with plausible motions.1 We conducted a\npsychophysical study where we asked over a hundred people to compare generated videos, and people\npreferred videos from our full model more often. Furthermore, by making the model conditional\non an input image, our model can sometimes predict a plausible (but \u201cincorrect\u201d) future. In our\nrecognition experiments, we show how our model has learned, without supervision, useful features\nfor human action classi\ufb01cation. Moreover, visualizations of the learned representation suggest future\ngeneration may be a promising supervisory signal for learning to recognize objects of motion.\n\n1See http://mit.edu/vondrick/tinyvideo for the animated videos.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe primary contribution of this paper is showing how to leverage large amounts of unlabeled video\nin order to acquire priors about scene dynamics. The secondary contribution is the development of a\ngenerative model for video. The remainder of this paper describes these contributions in detail. In\nsection 2, we describe our generative model for video. In section 3, we present several experiments to\nanalyze the generative model. We believe that generative video models can impact many applications,\nsuch as in simulations, forecasting, and representation learning.\n\n1.1 Related Work\n\nThis paper builds upon early work in generative video models [29]. However, previous work\nhas focused mostly on small patches, and evaluated it for video clustering. Here, we develop a\ngenerative video model for natural scenes using state-of-the-art adversarial learning methods [9, 31].\nConceptually, our work is related to studies into fundamental roles of time in computer vision\n[30, 12, 2, 7, 24]. However, here we are interested in generating short videos with realistic temporal\nsemantics, rather than detecting or retrieving them.\nOur technical approach builds on recent work in generative adversarial networks for image modeling\n[9, 31, 4, 47, 28], which we extend to video. To our knowledge, there has been relatively little\nwork extensively studying generative adversarial networks for video. Most notably, [22] also uses\nadversarial networks for video frame prediction. Our framework can generate videos for longer time\nscales and learn representations of video using unlabeled data. Our work is also related to efforts\nto predict the future in video [33, 22, 43, 50, 42, 17, 8, 54] as well as concurrent work in future\ngeneration [6, 15, 20, 49, 55]. Often these works may be viewed as a generative model conditioned\non the past frames. Our work complements these efforts in two ways. Firstly, we explore how to\ngenerate videos from scratch (not conditioned on the past). Secondly, while prior work has used\ngenerative models in video settings mostly on a single frame, we jointly generate a sequence of\nframes (32 frames) using spatio-temporal convolutional networks, which may help prevent drifts due\nto errors accumulating.\nWe leverage approaches for recognizing actions in video with deep networks, but apply them for\nvideo generation instead. We use spatio-temporal 3D convolutions to model videos [40], but we use\nfractionally strided convolutions [51] instead because we are interested in generation. We also use\ntwo-streams to model video [34], but apply them for video generation instead of action recognition.\nHowever, our approach does not explicitly use optical \ufb02ow; instead, we expect the network to learn\nmotion features on its own. Finally, this paper is related to a growing body of work that capitalizes\non large amounts of unlabeled video for visual recognition tasks [18, 46, 37, 13, 24, 25, 3, 32, 26, 27,\n19, 41, 42, 1]. We instead leverage large amounts of unlabeled video for generation.\n\n2 Generative Models for Video\n\nIn this section, we present a generative model for videos. We propose to use generative adversarial\nnetworks [9], which have been shown to have good performance on image generation [31, 4].\n\n2.1 Review: Generative Adversarial Networks\n\nThe main idea behind generative adversarial networks [9] is to train two networks: a generator\nnetwork G tries to produce a video, and a discriminator network D tries to distinguish between\n\u201creal\u201c videos and \u201cfake\u201d generated videos. One can train these networks against each other in a\nmin-max game where the generator seeks to maximally fool the discriminator while simultaneously\nthe discriminator seeks to detect which examples are fake:\n\nmin\nwG\n\nmax\nwD\n\nEx\u223cpx(x) [log D(x; wD)] + Ez\u223cpz(z) [log (1 \u2212 D(G(z; wG); wD))]\n\n(1)\n\nwhere z is a latent \u201ccode\u201d that is often sampled from a simple distribution (such as a normal\ndistribution) and x \u223c px(x) samples from the data distribution. In practice, since we do not know the\ntrue distribution of data px(x), we can estimate the expectation by drawing from our dataset.\nSince we will optimize Equation 1 with gradient based methods (SGD), the two networks G and\nD can take on any form appropriate for the task as long as they are differentiable with respect to\nparameters wG and wD. We design a G and D for video.\n\n2\n\n\fFigure 1: Video Generator Network: We illustrate our network architecture for the generator. The\ninput is 100 dimensional (Gaussian noise). There are two independent streams: a moving foreground\npathway of fractionally-strided spatio-temporal convolutions, and a static background pathway of\nfractionally-strided spatial convolutions, both of which up-sample. These two pathways are combined\nto create the generated video using a mask from the motion pathway. Below each volume is its size\nand the number of channels in parenthesis.\n\n2.2 Generator Network\nThe input to the generator network is a low-dimensional latent code z \u2208 Rd. In most cases, this code\ncan be sampled from a distribution (e.g., Gaussian). Given a code z, we wish to produce a video.\nWe design the architecture of the generator network with a few principles in mind. Firstly, we want the\nnetwork to be invariant to translations in both space and time. Secondly, we want a low-dimensional\nz to be able to produce a high-dimensional output (video). Thirdly, we want to assume a stationary\ncamera and take advantage of the the property that usually only objects move. We are interested\nin modeling object motion, and not the motion of cameras. Moreover, since modeling that the\nbackground is stationary is important in video recognition tasks [44], it may be helpful in video\ngeneration as well. We explore two different network architectures:\nOne Stream Architecture: We combine spatio-temporal convolutions [14, 40] with fractionally\nstrided convolutions [51, 31] to generate video. Three dimensional convolutions provide spatial\nand temporal invariance, while fractionally strided convolutions can upsample ef\ufb01ciently in a deep\nnetwork, allowing z to be low-dimensional. We use an architecture inspired by [31], except extended\nin time. We use a \ufb01ve layer network of 4 \u00d7 4 \u00d7 4 convolutions with a stride of 2, except for the \ufb01rst\nlayer which uses 2 \u00d7 4 \u00d7 4 convolutions (time \u00d7 width \u00d7 height). We found that these kernel sizes\nprovided an appropriate balance between training speed and quality of generations.\nTwo Stream Architecture: The one stream architecture does not model that the world is stationary\nand usually only objects move. We experimented with making this behavior explicit in the model. We\nuse an architecture that enforces a static background and moving foreground. We use a two-stream\narchitecture where the generator is governed by the combination:\n\nG2(z) = m(z) (cid:12) f (z) + (1 \u2212 m(z)) (cid:12) b(z).\n\n(2)\nOur intention is that 0 \u2265 m(z) \u2265 1 can be viewed as a spatio-temporal mask that selects either\nthe foreground f (z) model or the background model b(z) for each pixel location and timestep. To\nenforce a background model in the generations, b(z) produces a spatial image that is replicated over\ntime, while f (z) produces a spatio-temporal cuboid masked by m(z). By summing the foreground\nmodel with the background model, we can obtain the \ufb01nal generation. Note that (cid:12) is element-wise\nmultiplication, and we replicate singleton dimensions to match its corresponding tensor. During\nlearning, we also add to the objective a small sparsity prior on the mask \u03bb(cid:107)m(z)(cid:107)1 for \u03bb = 0.1,\nwhich we found helps encourage the network to use the background stream.\n\n3\n\nBackground\tStream2D\tconvolutionsForeground\tStream3D\tconvolutionsNoise100\tdimMaskForegroundBackgroundReplicate\tover\tTimemf+(1m)bGenerated\tVideoSpace-Time\tCuboidTanhSigmoidTanh\fWe use fractionally strided convolutional networks for m(z), f (z), and b(z). For f (z), we use the\nsame network as the one-stream architecture, and for b(z) we use a similar generator architecture to\n[31]. We only use their architecture; we do not initialize with their learned weights. To create the\nmask m(z), we use a network that shares weights with f (z) except the last layer, which has only\none output channel. We use a sigmoid activation function for the mask. We visualize the two-stream\narchitecture in Figure 1. In our experiments, the generator produces 64 \u00d7 64 videos for 32 frames,\nwhich is a little over a second.\n\n2.3 Discriminator Network\n\nThe discriminator needs to be able to solve two problems: \ufb01rstly, it must be able to classify realistic\nscenes from synthetically generated scenes, and secondly, it must be able to recognize realistic motion\nbetween frames. We chose to design the discriminator to be able to solve both of these tasks with the\nsame model. We use a \ufb01ve-layer spatio-temporal convolutional network with kernels 4 \u00d7 4 \u00d7 4 so\nthat the hidden layers can learn both visual models and motion models. We design the architecture to\nbe reverse of the foreground stream in the generator, replacing fractionally strided convolutions with\nstrided convolutions (to down-sample instead of up-sample), and replacing the last layer to output a\nbinary classi\ufb01cation (real or not).\n\n2.4 Learning and Implementation\n\nWe train the generator and discriminator with stochastic gradient descent. We alternate between\nmaximizing the loss w.r.t. wD and minimizing the loss w.r.t. wG until a \ufb01xed number of iterations.\nAll networks are trained from scratch. Our implementation is based off a modi\ufb01ed version of [31] in\nTorch7. We used a more numerically stable implementation of cross entropy loss to prevent over\ufb02ow.\nWe use the Adam [16] optimizer and a \ufb01xed learning rate of 0.0002 and momentum term of 0.5. The\nlatent code has 100 dimensions, which we sample from a normal distribution. We use a batch size\nof 64. We initialize all weights with zero mean Gaussian noise with standard deviation 0.01. We\nnormalize all videos to be in the range [\u22121, 1]. We use batch normalization [11] followed by the\nReLU activation functions after every layer in the generator, except the output layers, which uses\ntanh. Following [31], we also use batch normalization in the discriminator except for the \ufb01rst layer\nand we instead use leaky ReLU [48]. Training typically took several days on a GPU.\n\n3 Experiments\n\nWe experiment with the generative adversarial network for video (VGAN) on both generation and\nrecognition tasks. We also show several qualitative examples online.\n\n3.1 Unlabeled Video Dataset\n\nWe use a large amount of unlabeled video to train our model. We downloaded over two million videos\nfrom Flickr [39] by querying for popular Flickr tags as well as querying for common English words.\nFrom this pool, we created two datasets:\nUn\ufb01ltered Unlabeled Videos: We use these videos directly, without any \ufb01ltering, for representation\nlearning. The dataset is over 5, 000 hours. Filtered Unlabeled Videos: To evaluate generations, we\nuse the Places2 pre-trained model [53] to automatically \ufb01lter the videos by scene category. Since\nimage/video generation is a challenging problem, we assembled this dataset to better diagnose\nstrengths and weaknesses of approaches. We experimented with four scene categories: golf course,\nhospital rooms (babies), beaches, and train station.\nStabilization: As we are interested in the movement of objects and not camera shake, we stabilize\nthe camera motion for both datasets. We extract SIFT keypoints [21], use RANSAC to estimate a\nhomography (rotation, translation, scale) between adjacent frames, and warp frames to minimize\nbackground motion. When the homography moved out of the frame, we \ufb01ll in the missing values\nusing the previous frames. If the homography has too large of a re-projection error, we ignore that\nsegment of the video for training, which only happened 3% of the time. The only other pre-processing\nwe do is normalizing the videos to be in the range [\u22121, 1]. We extract frames at native frame rate (25\nfps). We use 32-frame videos of spatial resolution 64 \u00d7 64.\n\n4\n\n\fFigure 2: Video Generations: We show some generations from the two-stream model. The red ar-\nrows highlight motions. Please see http://mit.edu/vondrick/tinyvideo for animated movies.\n\n3.2 Video Generation\n\nWe evaluate both the one-stream and two-stream generator. We trained a generator for each scene\ncategory in our \ufb01ltered dataset. We perform both a qualitative evaluation as well as a quantitative\npsychophysical evaluation to measure the perceptual quality of the generated videos.\nQualitative Results: We show several examples of the videos generated from our model in Figure\n2. We observe that a) the generated scenes tend to be fairly sharp and that b) the motion patterns\nare generally correct for their respective scene. For example, the beach model tends to produce\nbeaches with crashing waves, the golf model produces people walking on grass, and the train station\ngenerations usually show train tracks and a train with windows rapidly moving along it. While the\nmodel usually learns to put motion on the right objects, one common failure mode is that the objects\nlack resolution. For example, the people in the beaches and golf courses are often blobs. Nevertheless,\nwe believe it is promising that our model can generate short motions. We visualize the behavior of\nthe two-stream architecture in Figure 3.\nBaseline: Since to our knowledge there are no existing large-scale generative models of video ([33]\nrequires an input frame), we develop a simple but reasonable baseline for this task. We train an\nautoencoder over our data. The encoder is similar to the discriminator network (except producing\n100 dimensional code), while the decoder follows the two-stream generator network. Hence, the\nbaseline autoencoder network has a similar number of parameters as our full approach. We then feed\nexamples through the encoder and \ufb01t a Gaussian Mixture Model (GMM) with 256 components over\nthe 100 dimensional hidden space. To generate a novel video, we sample from this GMM, and feed\nthe sample through the decoder.\nEvaluation Metric: We quantitatively evaluate our generation using a psychophysical two-alternative\nforced choice with workers on Amazon Mechanical Turk. We show a worker two random videos,\n\n5\n\nFrame\t\r \u00a01Frame\t\r \u00a032Frame\t\r \u00a016Beach\t\r \u00a0Generated\t\r \u00a0VideosFrame\t\r \u00a01Frame\t\r \u00a032Frame\t\r \u00a016Golf\t\r \u00a0Course\t\r \u00a0Generated\t\r \u00a0VideosFrame\t\r \u00a01Frame\t\r \u00a032Frame\t\r \u00a016Train\t\r \u00a0Station\t\r \u00a0Generated\t\r \u00a0VideosFrame\t\r \u00a01Frame\t\r \u00a032Frame\t\r \u00a016Hospital\t\r \u00a0/\t\r \u00a0Baby\t\r \u00a0Generated\t\r \u00a0Videos\fFigure 3: Streams: We visualize the background, foreground, and masks for beaches (left) and golf\n(right). The network generally learns to disentangle the foreground from the background.\n\n\u201cWhich video is more realistic?\u201d\nRandom Preference\nPrefer VGAN Two Stream over Autoencoder\nPrefer VGAN One Stream over Autoencoder\nPrefer VGAN Two Stream over VGAN One Stream\nPrefer VGAN Two Stream over Real\nPrefer VGAN One Stream over Real\nPrefer Autoencoder over Real\n\nGolf Beach Train Baby Mean\n50\n88\n85\n55\n21\n17\n4\n\n50\n71\n73\n52\n6\n8\n2\n\n50\n82\n82\n53\n18\n16\n3\n\nPercentage of Trials\n\n50\n83\n88\n58\n23\n21\n2\n\n50\n87\n85\n47\n23\n19\n4\n\nTable 1: Video Generation Preferences: We show two videos to workers on Amazon Mechanical\nTurk, and ask them to choose which video is more realistic. The table shows the percentage of times\nthat workers prefer one generations from one model over another. In all cases, workers tend to prefer\nvideo generative adversarial networks over an autoencoder. In most cases, workers show a slight\npreference for the two-stream model.\n\nand ask them \u201cWhich video is more realistic?\u201d We collected over 13, 000 opinions across 150 unique\nworkers. We paid workers one cent per comparison, and required workers to historically have a 95%\napproval rating on MTurk. We experimented with removing bad workers that frequently said real\nvideos were not realistic, but the relative rankings did not change. We designed this experiment\nfollowing advice from [38], which advocates evaluating generative models for the task at hand. In\nour case, we are interested in perceptual quality of motion. We consider a model X better than model\nY if workers prefer generations from X more than generations from Y.\nQuantitative Results: Table 1 shows the percentage of times that workers preferred generations\nfrom one model over another. Workers consistently prefer videos from the generative adversarial\nnetwork more than an autoencoder. Additionally, workers show a slight preference for the two-stream\narchitecture, especially in scenes where the background is large (e.g., golf course, beach). Although\nthe one-stream architecture is capable of generating stationary backgrounds, it may be dif\ufb01cult to\n\ufb01nd this solution, motivating a more explicit architecture. The one-stream architecture generally\nproduces high-frequency temporal \ufb02ickering in the background. To evaluate whether static frames are\nbetter than our generations, we also ask workers to choose between our videos and a static frame, and\nworkers only chose the static frame 38% of the time, suggesting our model produces more realistic\nmotion than static frames on average. Finally, while workers generally can distinguish real videos\nfrom generated videos, the workers show the most confusion with our two-stream model compared to\nbaselines, suggesting the two-stream generations may be more realistic on average.\n\n3.3 Video Representation Learning\n\nWe also experimented with using our model as a way to learn unsupervised representations for video.\nWe train our two-stream model with over 5, 000 hours of un\ufb01ltered, unlabeled videos from Flickr.\nWe then \ufb01ne-tune the discriminator on the task of interest (e.g., action recognition) using a relatively\nsmall set of labeled video. To do this, we replace the last layer (which is a binary classi\ufb01er) with a\nK-way softmax classi\ufb01er. We also add dropout [36] to the penultimate layer to reduce over\ufb01tting.\nAction Classi\ufb01cation: We evaluated performance on classifying actions on UCF101 [35]. We\nreport accuracy in Figure 4a. Initializing the network with the weights learned from the generative\nadversarial network outperforms a randomly initialized network, suggesting that it has learned\nan useful internal representation for video. Interestingly, while a randomly initialized network\nunder-performs hand-crafted STIP features [35], the network initialized with our model signi\ufb01cantly\n\n6\n\n+=+=BackgroundForegroundGeneration+=+=BackgroundForegroundGeneration\f(b) Performance vs # Data\n\nAccuracy\nMethod\n0.9%\nChance\nSTIP Features [35]\n43.9%\nTemporal Coherence [10] 45.4%\nShuf\ufb02e and Learn [24]\n50.2%\n36.7%\nVGAN + Random Init\n49.3%\nVGAN + Logistic Reg\nVGAN + Fine Tune\n52.1%\nImageNet Supervision [45]91.4%\n(a) Accuracy with Unsupervised Methods\nFigure 4: Video Representation Learning: We evaluate the representation learned by the discrimi-\nnator for action classi\ufb01cation on UCF101 [35]. (a) By \ufb01ne-tuning the discriminator on a relatively\nsmall labeled dataset, we can obtain better performance than random initialization, and better than\nhand-crafted space-time interest point (STIP) features. Moreover, our model slightly outperforms\nanother unsupervised video representation [24] despite using an order of magnitude fewer learned\nparameters and only 64 \u00d7 64 videos. Note unsupervised video representations are still far from\nmodels that leverage external supervision. (b) Our unsupervised representation with less labeled data\noutperforms random initialization with all the labeled data. Our results suggest that, with just 1/8th\nof the labeled data, we can match performance to a randomly initialized network that used all of the\nlabeled data. (c) The \ufb01ne-tuned model has larger relative gain over random initialization in cases\nwith less labeled data. Note that (a) is over all train/test splits of UCF101, while (b,c) is over the \ufb01rst\nsplit in order to make experiments less expensive.\n\n(c) Relative Gain vs # Data\n\noutperforms it. We also experimented with training a logistic regression on only the last layer,\nwhich performed worse. Finally, our model slightly outperforms another recent unsupervised video\nrepresentation learning approach [24]. However, our approach uses an order of magnitude fewer\nparameters, less layers (5 layers vs 8 layers), and low-resolution video.\nPerformance vs Data: We also experimented with varying the amount of labeled training data\navailable to our \ufb01ne-tuned network. Figure 4b reports performance versus the amount of labeled\ntraining data available. As expected, performance increases with more labeled data. The \ufb01ne-tuned\nmodel shows an advantage in low data regimes: even with one eighth of the labeled data, the \ufb01ne-\ntuned model still beats a randomly initialized network. Moreover, Figure 4c plots the relative accuracy\ngain over the \ufb01ne-tuned model and the random initialization (\ufb01ne-tuned performance divided by\nrandom initialized performance). This shows that \ufb01ne-tuning with our model has larger relative gain\nover random initialization in cases with less labeled data, showing its utility in low-data regimes.\n\n3.4 Future Generation\n\nWe investigate whether our approach can be used to generate the future of a static image. Speci\ufb01cally,\ngiven a static image x0, can we extrapolate a video of possible consequent frames?\nEncoder: We utilize the same model as our two-stream model, however we must make one change\nin order to input the static image instead of the latent code. We can do this by attaching a \ufb01ve-\nlayer convolutional network to the front of the generator which encodes the image into the latent\nspace, similar to a conditional generative adversarial network [23]. The rest of the generator and\ndiscriminator networks remain the same. However, we add an additional loss term that minimizes\nthe L1 distance between the input and the \ufb01rst frame of the generated image. We do this so that the\ngenerator creates videos consistent with the input image. We train from scratch with the objective:\n\n(3)\n\nmin\nwG\n\nmax\nwD\n\nEx\u223cpx(x) [log D(x; wD)] + Ex0\u223cpx0 (x0) [log (1 \u2212 D(G(x0; wG); wD))]\n+Ex0\u223cpx0 (x0)\n\n(cid:2)\u03bb(cid:107)x0 \u2212 G0(x0; wG)(cid:107)2\n\n(cid:3)\n\n2\n\nwhere x0 is the \ufb01rst frame of the input, G0(\u00b7) is the \ufb01rst frame of the generated video, and \u03bb \u2208 R is a\nhyperparameter. The discriminator will try to classify realistic frames and realistic motions as before,\nwhile the generator will try to produce a realistic video such that the \ufb01rst frame is reconstructed well.\nResults: We qualitatively show a few examples of our approach in Figure 5 using held-out testing\nvideos. Although the extrapolations are rarely correct, they often have fairly plausible motions. The\n\n7\n\n# of Labeled Training Videos102103104Accuracy (percentage)05101520253035404550VGAN InitRandom InitChance# of Labeled Training Videos0200040006000800010000Relative Accuracy Gain1.51.61.71.81.922.1\fFigure 5: Future Generation: We show one application of generative video models where we\npredict videos given a single static image. The red arrows highlight regions of motion. Since this is\nan ambiguous task, our model usually does not generate the correct video, however the generation is\noften plausible. Please see http://mit.edu/vondrick/tinyvideo for animated movies.\n\n(a) hidden unit that \ufb01res on \u201cperson\u201d\n\n(b) hidden unit that \ufb01res on \u201ctrain tracks\u201d\n\nFigure 6: Visualizing Representation: We visualize some hidden units in the encoder of the future\ngenerator, following the technique from [52]. We highlight regions of images that a particular\nconvolutional hidden unit maximally activates on. While not at all units are semantic, some units\nactivate on objects that are sources for motion, such as people and train tracks.\n\nmost common failure is that the generated video has a scene similar but not identical to the input\nimage, such as by changing colors or dropping/hallucinating objects. The former could be solved\nby a color histogram normalization in post-processing (which we did not do for simplicity), while\nwe suspect the latter will require building more powerful generative models. The generated videos\nare usually not the correct video, but we observe that often the motions are plausible. We are not\naware of an existing approach that can directly generate multi-frame videos from a single static image.\n[33, 22] can generate video, but they require multiple input frames and empirically become blurry\nafter extrapolating many frames. [43, 50] can predict optic \ufb02ow from a single image, but they do\nnot generate several frames of motion and may be susceptible to warping artifacts. We believe this\nexperiment shows an important application of generative video models.\nVisualizing Representation: Since generating the future requires understanding how objects move,\nthe network may need learn to recognize some objects internally, even though it is not supervised to\ndo so. Figure 6 visualizes some activations of hidden units in the third convolutional layer. While not\nat all units are semantic, some of the units tend to be selective for objects that are sources of motion,\nsuch as people or train tracks. These visualizations suggest that scaling up future generation might be\na promising supervisory signal for object recognition and complementary to [27, 5, 46].\nConclusion: Understanding scene dynamics will be crucial for the next generation of computer vision\nsystems. In this work, we explored how to learn some dynamics from large amounts of unlabeled\nvideo by capitalizing on adversarial learning methods. Since annotating dynamics is expensive, we\nbelieve learning from unlabeled data is a promising direction. While we are still a long way from\nfully harnessing the potential of unlabeled video, our experiments support that abundant unlabeled\nvideo can be lucrative for both learning to generate videos and learning visual representations.\nAcknowledgements: We thank Yusuf Aytar for dataset discussions. We thank MIT TIG, especially Garrett\nWollman, for troubleshooting issues on storing the 26 TB of video. We are grateful for the Torch7 community\nfor answering many questions. NVidia donated GPUs used for this research. This work was supported by NSF\ngrant #1524817 to AT, START program at UMBC to HP, and the Google PhD fellowship to CV.\n\n8\n\nFrame\t\r \u00a01Frame\t\r \u00a032Frame\t\r \u00a016Static\t\r \u00a0InputFrame\t\r \u00a01Frame\t\r \u00a032Frame\t\r \u00a016Static\t\r \u00a0InputGenerated\t\r \u00a0VideoGenerated\t\r \u00a0Video\fReferences\n[1] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Learning sound representations from unlabeled video. NIPS, 2016.\n[2] Tali Basha, Yael Moses, and Shai Avidan. Photo sequencing. In ECCV. 2012.\n[3] Chao-Yeh Chen and Kristen Grauman. Watching unlabeled video helps learn new human actions from very few labeled snapshots. In\n\nCVPR, 2013.\n\nIn NIPS, 2015.\n\n2015.\n\n[4] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks.\n\n[5] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.\n[6] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. arXiv, 2016.\n[7] J\u00b4ozsef Fiser and Richard N Aslin. Statistical learning of higher-order temporal structure from visual shape sequences. JEP, 2002.\n[8] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. Recurrent network models for human dynamics. In ICCV, 2015.\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.\n[9]\nGenerative adversarial nets. In NIPS, 2014.\n\n[10] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.\n[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv,\n\n[12] Phillip Isola, Joseph J Lim, and Edward H Adelson. Discovering states and transformations in image collections. In CVPR, 2015.\n[13] Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to ego-motion. In ICCV, 2015.\n[14] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. PAMI, 2013.\n[15] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video\n\npixel networks. arXiv, 2016.\n\n[16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv, 2014.\n[17] Kris Kitani, Brian Ziebart, James Bagnell, and Martial Hebert. Activity forecasting. ECCV, 2012.\n[18] Quoc V Le. Building high-level features using large scale unsupervised learning. In CASSP, 2013.\n[19] Yin Li, Manohar Paluri, James M Rehg, and Piotr Doll\u00b4ar. Unsupervised learning of edges. arXiv, 2015.\n[20] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning.\n\narXiv, 2016.\n\n[21] David G Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.\n[22] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv, 2015.\n[23] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv, 2014.\n[24]\n\nIshan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuf\ufb02e and Learn: Unsupervised Learning using Temporal Order Veri\ufb01cation. In\nECCV, 2016.\n\n[25] Hossein Mobahi, Ronan Collobert, and Jason Weston. Deep learning from temporal coherence in video. In ICML, 2009.\n[26] Phuc Xuan Nguyen, Gregory Rogez, Charless Fowlkes, and Deva Ramanan. The open world of micro-videos. arXiv, 2016.\n[27] Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for\n\n[28] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting.\n\n[29] Nikola Petrovic, Aleksandar Ivanovic, and Nebojsa Jojic. Recursive estimation of generative models of video. In CVPR, 2006.\n[30] Lyndsey Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William\n\nFreeman. Seeing the arrow of time. In CVPR, 2014.\n\n[31] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial\n\n[32] Vignesh Ramanathan, Kevin Tang, Greg Mori, and Li Fei-Fei. Learning temporal embeddings for complex video analysis. In CVPR,\n\nvisual learning. arXiv, 2016.\n\narXiv, 2016.\n\nnetworks. arXiv, 2015.\n\n2015.\n\n[33] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a\n\nbaseline for generative models of natural videos. arXiv, 2014.\n\n[34] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.\n[35] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.\n\narXiv, 2012.\n\n2015.\n\n[36] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent\n\nneural networks from over\ufb01tting. JMLR, 2014.\n\n[37] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. arXiv,\n\n[38] Lucas Theis, A\u00a8aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv, 2015.\n[39] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m:\n\n[40] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional\n\nThe new data in multimedia research. ACM, 2016.\n\nnetworks. arXiv, 2014.\n\n[41] Carl Vondrick, Donald Patterson, and Deva Ramanan. Ef\ufb01ciently scaling up crowdsourced video annotation. IJCV, 2013.\n[42] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. CVPR, 2015.\n[43] Jacob Walker, Arpan Gupta, and Martial Hebert. Patch to the future: Unsupervised visual prediction. In CVPR, 2014.\n[44] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, 2013.\n[45] Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. Towards good practices for very deep two-stream convnets. arXiv, 2015.\n[46] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.\n[47] Xiaolong Wang and Abhinav Gupta. Generative image modeling using style and structure adversarial networks. arXiv, 2016.\n[48] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of recti\ufb01ed activations in convolutional network. arXiv, 2015.\n[49] Tianfan Xue, Jiajun Wu, Katherine L Bouman, and William T Freeman. Visual dynamics: Probabilistic future frame synthesis via cross\n\nconvolutional networks. arXiv, 2016.\n\n[50] Jenny Yuen and Antonio Torralba. A data-driven approach for event prediction. In ECCV. 2010.\n[51] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. In CVPR, 2010.\n[52] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. arXiv,\n\n[53] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using\n\n2014.\n\nplaces database. In NIPS, 2014.\n\n[54] Yipin Zhou and Tamara L Berg. Temporal perception and prediction in ego-centric video. In ICCV, 2015.\n[55] Yipin Zhou and Tamara L Berg. Learning temporal transformations from time-lapse videos. In ECCV, 2016.\n\n9\n\n\f", "award": [], "sourceid": 339, "authors": [{"given_name": "Carl", "family_name": "Vondrick", "institution": "MIT"}, {"given_name": "Hamed", "family_name": "Pirsiavash", "institution": "University of Maryland, Baltimore County"}, {"given_name": "Antonio", "family_name": "Torralba", "institution": "MIT"}]}