{"title": "PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs", "book": "Advances in Neural Information Processing Systems", "page_first": 879, "page_last": 888, "abstract": "The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal variations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial appearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal  representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures.", "full_text": "PredRNN: Recurrent Neural Networks for Predictive\n\nLearning using Spatiotemporal LSTMs\n\nYunbo Wang\n\nSchool of Software\nTsinghua University\n\nwangyb15@mails.tsinghua.edu.cn\n\nMingsheng Long\u2217\nSchool of Software\nTsinghua University\n\nmingsheng@tsinghua.edu.cn\n\nJianmin Wang\n\nSchool of Software\nTsinghua University\n\njimwang@tsinghua.edu.cn\n\nZhifeng Gao\n\nSchool of Software\nTsinghua University\n\ngzf16@mails.tsinghua.edu.cn\n\nPhilip S. Yu\n\nSchool of Software\nTsinghua University\n\npsyu@uic.edu\n\nAbstract\n\nThe predictive learning of spatiotemporal sequences aims to generate future images\nby learning from the historical frames, where spatial appearances and temporal vari-\nations are two crucial structures. This paper models these structures by presenting a\npredictive recurrent neural network (PredRNN). This architecture is enlightened by\nthe idea that spatiotemporal predictive learning should memorize both spatial ap-\npearances and temporal variations in a uni\ufb01ed memory pool. Concretely, memory\nstates are no longer constrained inside each LSTM unit. Instead, they are allowed\nto zigzag in two directions: across stacked RNN layers vertically and through all\nRNN states horizontally. The core of this network is a new Spatiotemporal LSTM\n(ST-LSTM) unit that extracts and memorizes spatial and temporal representations\nsimultaneously. PredRNN achieves the state-of-the-art prediction performance on\nthree video prediction datasets and is a more general framework, that can be easily\nextended to other predictive learning tasks by integrating with other architectures.\n\n1\n\nIntroduction\n\nAs a key application of predictive learning, generating images conditioned on given consecutive\nframes has received growing interests in machine learning and computer vision communities. To\nlearn representations of spatiotemporal sequences, recurrent neural networks (RNN) [17, 27] with\nthe Long Short-Term Memory (LSTM) [9] have been recently extended from supervised sequence\nlearning tasks, such as machine translation [22, 2], speech recognition [8], action recognition [28, 5]\nand video captioning [5], to this spatiotemporal predictive learning scenario [21, 16, 19, 6, 25, 12].\n\n1.1 Why spatiotemporal memory?\n\nIn spatiotemporal predictive learning, there are two crucial aspects: spatial correlations and temporal\ndynamics. The performance of a prediction system depends on whether it is able to memorize\nrelevant structures. However, to the best of our knowledge, the state-of-the-art RNN/LSTM predictive\nlearning methods [19, 21, 6, 12, 25] focus more on modeling temporal variations (such as the object\nmoving trajectories), with memory states being updated repeatedly over time inside each LSTM unit.\nAdmittedly, the stacked LSTM architecture is proved powerful for supervised spatiotemporal learning\n\n\u2217Corresponding author: Mingsheng Long\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(such as video action recognition [5, 28]). Two conditions are met in this scenario: (1) Temporal\nfeatures are strong enough for classi\ufb01cation tasks. In contrast, \ufb01ne-grained spatial appearances prove\nto be less signi\ufb01cant; (2) There are no complex visual structures to be modeled in the expected\noutputs so that spatial representations can be highly abstracted. However, spatiotemporal predictive\nleaning does not satisfy these conditions. Here, spatial deformations and temporal dynamics are\nequally signi\ufb01cant to generating future frames. A straightforward idea is that if we hope to foretell\nthe future, we need to memorize as many historical details as possible. When we recall something\nhappened before, we do not just recall object movements, but also recollect visual appearances from\ncoarse to \ufb01ne. Motivated by this, we present a new recurrent architecture called Predictive RNN\n(PredRNN), which allows memory states belonging to different LSTMs to interact across layers (in\nconventional RNNs, they are mutually independent). As the key component of PredRNN, we design\na novel Spatiotemporal LSTM (ST-LSTM) unit. It models spatial and temporal representations in a\nuni\ufb01ed memory cell and convey the memory both vertically across layers and horizontally over states.\nPredRNN achieves the state-of-the-art prediction results on three video datasets. It is a general and\nmodular framework for predictive learning and is not limited to video prediction.\n\n1.2 Related work\n\nRecent advances in recurrent neural network models provide some useful insights on how to predict\nfuture visual sequences based on historical observations. Ranzato et al.\n[16] de\ufb01ned a RNN\narchitecture inspired from language modeling, predicting the frames in a discrete space of patch\nclusters. Srivastava et al. [21] adapted the sequence to sequence LSTM framework. Shi et al. [19]\nextended this model to further extract visual representations by exploiting convolutions in both\ninput-to-state and state-to-state transitions. This Convolutional LSTM (ConvLSTM) model has\nbecome a seminal work in this area. Subsequently, Finn et al. [6] constructed a network based on\nConvLSTMs that predicts transformations on the input pixels for next-frame prediction. Lotter et al.\n[12] presented a deep predictive coding network where each ConvLSTM layer outputs a layer-speci\ufb01c\nprediction at each time step and produces an error term, which is then propagated laterally and\nvertically in the network. However, in their settings, the predicted next frame always bases on the\nwhole previous ground truth sequence. By contrast, we predict sequence from sequence, which is\nobviously more challenging. Patraucean et al. [15] and Villegas et al. [25] brought optical \ufb02ow\ninto RNNs to model short-term temporal dynamics, which is inspired by the two-stream CNNs [20]\ndesigned for action recognition. However, the optical \ufb02ow images are hard to use since they would\nbring in high additional computational costs and reduce the prediction ef\ufb01ciency. Kalchbrenner\net al. [10] proposed a Video Pixel Network (VPN) that estimates the discrete joint distribution\nof the raw pixel values in a video using the well-established PixelCNNs [24]. But it suffers from\nhigh computational complexity. Besides the above RNN architectures, other deep architectures are\ninvolved to solve the visual predictive learning problem. Oh et al. [14] de\ufb01ned a CNN-based action\nconditional autoencoder model to predict next frames in Atari games. Mathieu et al. [13] successfully\nemployed generative adversarial networks [7, 4] to preserve the sharpness of the predicted frames.\nIn summary, these existing visual prediction models yield different shortcomings due to different\ncauses. The RNN-based architectures [21, 16, 19, 6, 25, 12] model temporal structures with LSTMs,\nbut their predicted images tend to blur due to a loss of \ufb01ne-grained visual appearances. In contrast,\nCNN-based networks [13, 14] predict one frame at a time and generate future images recursively,\nwhich are prone to focus on spatial appearances and relatively weak in capturing long-term motions.\nIn this paper, we explore a new RNN framework for predictive learning and present a novel LSTM\nunit for memorizing spatiotemporal information simultaneously.\n\n2 Preliminaries\n\n2.1 Spatiotemporal predictive learning\n\nSuppose we are monitoring a dynamical system (e.g. a video clip) of P measurements over time,\nwhere each measurement (e.g. a RGB channel) is recorded at all locations in a spatial region\nrepresented by an M \u00d7 N grid (e.g. video frames). From the spatial view, the observation of these P\nmeasurements at any time can be represented by a tensor X \u2208 RP\u00d7M\u00d7N . From the temporal view,\nthe observations over T time steps form a sequence of tensors X1,X2, . . . ,XT . The spatiotemporal\npredictive learning problem is to predict the most probable length-K sequence in the future given the\n\n2\n\n\fgt = tanh(Wxg \u2217 Xt + Whg \u2217 Ht\u22121 + bg)\nit = \u03c3(Wxi \u2217 Xt + Whi \u2217 Ht\u22121 + Wci (cid:12) Ct\u22121 + bi)\nft = \u03c3(Wxf \u2217 Xt + Whf \u2217 Ht\u22121 + Wcf (cid:12) Ct\u22121 + bf )\nCt = ft (cid:12) Ct\u22121 + it (cid:12) gt\not = \u03c3(Wxo \u2217 Xt + Who \u2217 Ht\u22121 + Wco (cid:12) Ct + bo)\nHt = ot (cid:12) tanh(Ct),\n\n(2)\n\nprevious length-J sequence including the current observation:\n\n(cid:98)Xt+1, . . . , (cid:98)Xt+K = arg max\n\nXt+1,...,Xt+K\n\np (Xt+1, . . . ,Xt+K|Xt\u2212J+1, . . . ,Xt) .\n\n(1)\n\nSpatiotemporal predictive learning is an important problem, which could \ufb01nd crucial and high-impact\napplications in various domains: video prediction and surveillance, meteorological and environmental\nforecasting, energy and smart grid management, economics and \ufb01nance prediction, etc. Taking video\nprediction as an example, the measurements are the three RGB channels, and the observation at\neach time step is a 3D video frame of RGB image. Another example is radar-based precipitation\nforecasting, where the measurement is radar echo values and the observation at every time step is a\n2D radar echo map that can be visualized as an RGB image.\n\n2.2 Convolutional LSTM\n\nCompared with standard LSTMs, the convolutional LSTM (ConvLSTM) [19] is able to model the\nspatiotemporal structures simultaneously by explicitly encoding the spatial information into tensors,\novercoming the limitation of vector-variate representations in standard LSTM where the spatial\ninformation is lost. In ConvLSTM, all the inputs X1, . . . ,Xt, cell outputs C1, . . . ,Ct, hidden state\nH1, . . . ,Ht, and gates it, ft, gt, ot are 3D tensors in RP\u00d7M\u00d7N , where the \ufb01rst dimension is either the\nnumber of measurement (for inputs) or the number of feature maps (for intermediate representations),\nand the last two dimensions are spatial dimensions (M rows and N columns). To get a better picture\nof the inputs and states, we may imagine them as vectors standing on a spatial grid. ConvLSTM\ndetermines the future state of a certain cell in the M \u00d7 N grid by the inputs and past states of its\nlocal neighbors. This can easily be achieved by using convolution operators in the state-to-state and\ninput-to-state transitions. The key equations of ConvLSTM are shown as follows:\n\nwhere \u03c3 is sigmoid activation function, \u2217 and (cid:12) denote the convolution operator and the Hadamard\nproduct respectively. If the states are viewed as the hidden representations of moving objects, then a\nConvLSTM with a larger transitional kernel should be able to capture faster motions while one with a\nsmaller kernel can capture slower motions [19]. The use of the input gate it, forget gate ft, output\ngate ot, and input-modulation gate gt controls information \ufb02ow across the memory cell Ct. In this\nway, the gradient will be prevented from vanishing quickly by being trapped in the memory.\nThe ConvLSTM network adopts the encoder-decoder RNN architecture that is proposed in [23] and\nextended to video prediction in [21]. For a 4-layer ConvLSTM encoder-decoder network, input\nframes are fed into the the \ufb01rst layer and future video sequence is generated at the fourth one. In this\nprocess, spatial representations are encoded layer-by-layer, with hidden states being delivered from\nbottom to top. However, the memory cells that belong to these four layers are mutually independent\nand updated merely in time domain. Under these circumstances, the bottom layer would totally ignore\nwhat had been memorized by the top layer at the previous time step. Overcoming these drawbacks of\nthis layer-independent memory mechanism is important to the predictive learning of video sequences.\n\n3 PredRNN\n\nIn this section, we give detailed descriptions of the predictive recurrent neural network (PredRNN).\nInitially, this architecture is enlightened by the idea that a predictive learning system should memorize\nboth spatial appearances and temporal variations in a uni\ufb01ed memory pool. By doing this, we make\nthe memory states \ufb02ow through the whole network along a zigzag direction. Then, we would like\nto go a step further to see how to make the spatiotemporal memory interact with the original long\nshort-term memory. Thus we make explorations on the memory cell, memory gate and memory\nfusion mechanisms inside LSTMs/ConvLSTMs. We \ufb01nally derive a novel Spatiotemporal LSTM\n(ST-LSTM) unit for PredRNN, which is able to deliver memory states both vertically and horizontally.\n\n3\n\n\f3.1 Spatiotemporal memory \ufb02ow\n\nFigure 1: Left: The convolutional LSTM network with a spatiotemporal memory \ufb02ow. Right: The\nconventional ConvLSTM architecture. The orange arrows denote the memory \ufb02ow direction for all\nmemory cells.\n\nFor generating spatiotemporal predictions, we initially exploit convolutional LSTMs (ConvLSTM)\n[19] as basic building blocks. Stacked ConvLSTMs extract highly abstract features layer-by-layer\nand then make predictions by mapping them back to the pixel value space. In the conventional\nConvLSTM architecture, as illustrated in Figure 1 (right), the cell states are constrained inside each\nConvLSTM layer and be updated only horizontally. Information is conveyed upwards only by hidden\nstates. Such a temporal memory \ufb02ow is reasonable in supervised learning, because according to the\nstudy of the stacked convolutional layers, the hidden representations can be more and more abstract\nand class-speci\ufb01c from the bottom layer upwards. However, we suppose in predictive learning,\ndetailed information in raw input sequence should be maintained. If we want to see into the future,\nwe need to learn from representations extracted at different-level convolutional layers. Thus, we\napply a uni\ufb01ed spatiotemporal memory pool and alter RNN connections as illustrated in Figure 1\n(left). The orange arrows denote the feed-forward directions of LSTM memory cells. In the left\n\ufb01gure, a uni\ufb01ed memory is shared by all LSTMs which is updated along a zigzag direction. The key\nequations of the convolutional LSTM unit with a spatiotemporal memory \ufb02ow are shown as follows:\n\nt + bg)\n\ngt = tanh(Wxg \u2217 Xt1{l=1} + Whg \u2217 Hl\u22121\nit = \u03c3(Wxi \u2217 Xt1{l=1} + Whi \u2217 Hl\u22121\nft = \u03c3(Wxf \u2217 Xt1{l=1} + Whf \u2217 Hl\u22121\nMl\not = \u03c3(Wxo \u2217 Xt1{l=1} + Who \u2217 Hl\u22121\nHl\nt = ot (cid:12) tanh(Ml\nt).\n\nt = ft (cid:12) Ml\u22121\n\nt + it (cid:12) gt\n\nt + Wmi (cid:12) Ml\u22121\nt + Wmf (cid:12) Ml\u22121\n\nt + bi)\n\nt + bf )\n\nt + Wmo (cid:12) Ml\n\nt + bo)\n\n(3)\n\nThe input gate, input modulation gate, forget gate and output gate no longer depend on the hidden\nstates and cell states from the previous time step at the same layer. Instead, as illustrated in Figure\n1 (left), they rely on hidden states Hl\u22121\n(l \u2208 {1, ..., L}) that are updated by\nand cell states Ml\u22121\nthe previous layer at current time step. Speci\ufb01cally, the bottom LSTM unit receives state values\nfrom the top layer at the previous time step: for l = 1, Hl\u22121\nt\u22121. The four\nlayers in this \ufb01gure have different sets of input-to-state and state-to-state convolutional parameters,\nwhile they maintain a spatiotemporal memory cell and update its states separately and repeatedly\nas the information \ufb02ows through the current node. Note that in the revised ConvLSTM network\nwith a spatiotemporal memory in Figure 1, we replace the notation for memory cell from C to M to\nemphasize that it \ufb02ows in the zigzag direction instead of the horizontal direction.\n\nt = ML\n\nt\u22121, Ml\u22121\n\nt = HL\n\nt\n\nt\n\n4\n\nMt\u22121l=2,Ht\u22121l=2Mt\u22121l=1,Ht\u22121l=1Mt\u22121l=3,Ht\u22121l=3Mt\u22121l=4,Ht\u22121l=4Mtl=1,Htl=1Mtl=2,Htl=2Mtl=3,Htl=3Mtl=4,Htl=4Mt+1l=4,Ht+1l=4Mt+1l=3,Ht+1l=3Mt+1l=2,Ht+1l=2Mt+1l=1,Ht+1l=1W1W1W2W2W3W3W4W4Xt\u02c6Xt+1Htl=3Htl=2Htl=1Ctl=1,Htl=1Ct\u22121l=1,Ht\u22121l=1Ctl=2,Htl=2Ct\u22121l=2,Ht\u22121l=2Ctl=3,Htl=3Ct\u22121l=3,Ht\u22121l=3Ct\u22121l=4,Ht\u22121l=4Ctl=4,Htl=4W1W1W1W1W1W1W2W2W2W2W2W2W3W3W3W3W3W3W4W4W4W4W4W4\u02c6Xt+1\u02c6Xt+2\u02c6XtXtXt+1Xt\u22121Mt\u22122l=4,Ht\u22122l=4\f3.2 Spatiotemporal LSTM\n\nFigure 2: ST-LSTM (left) and PredRNN (right). The orange circles in the ST-LSTM unit denotes the\ndifferences compared with the conventional ConvLSTM. The orange arrows in PredRNN denote the\nspatiotemporal memory \ufb02ow, namely the transition path of spatiotemporal memory Ml\nt in the left.\n\nHowever, dropping the temporal \ufb02ow in the horizontal direction is prone to sacri\ufb01cing temporal\ncoherency.\nIn this section, we present the predictive recurrent neural network (PredRNN), by\nreplacing convolutional LSTMs with a novel spatiotemporal long short-term memory (ST-LSTM)\nunit (see Figure 2). In the architecture presented in the previous subsection, the spatiotemporal\nmemory cells are updated in a zigzag direction, and information is delivered \ufb01rst upwards across\nlayers then forwards over time. This enables ef\ufb01cient \ufb02ow of spatial information, but is prone to\nvanishing gradient since the memory needs to \ufb02ow a longer path between distant states. With the aid\nof ST-LSTMs, our PredRNN model in Figure 2 enables simultaneous \ufb02ows of both standard temporal\nmemory and the proposed spatiotemporal memory. The equations of ST-LSTM are shown as follows:\n\nt\u22121 + bg)\n\nt\u22121 + bi)\nt\u22121 + bf )\n\ngt = tanh(Wxg \u2217 Xt + Whg \u2217 Hl\nit = \u03c3(Wxi \u2217 Xt + Whi \u2217 Hl\nft = \u03c3(Wxf \u2217 Xt + Whf \u2217 Hl\nCl\nt = ft (cid:12) Cl\nt = tanh(W(cid:48)\ng(cid:48)\nt = \u03c3(W(cid:48)\ni(cid:48)\nt = \u03c3(W(cid:48)\nf(cid:48)\nMl\nt (cid:12) g(cid:48)\nt = f(cid:48)\not = \u03c3(Wxo \u2217 Xt + Who \u2217 Hl\nHl\nt = ot (cid:12) tanh(W1\u00d71 \u2217 [Cl\n\nxi \u2217 Xt + Wmi \u2217 Ml\u22121\nxf \u2217 Xt + Wmf \u2217 Ml\u22121\nt (cid:12) Ml\u22121\n\nt\u22121 + it (cid:12) gt\nxg \u2217 Xt + Wmg \u2217 Ml\u22121\nt + b(cid:48)\ni)\nt + b(cid:48)\nf )\n\nt + i(cid:48)\n\nt\n\nt\u22121 + Wco \u2217 Cl\nt,Ml\n\nt]).\n\nt + Wmo \u2217 Ml\n\nt + bo)\n\nt + b(cid:48)\ng)\n\n(4)\n\nTwo memory cells are maintained: Cl\nt is the standard temporal cell that is delivered from the previous\nnode at t \u2212 1 to the current time step within each LSTM unit. Ml\nt is the spatiotemporal memory we\ndescribed in the current section, which is conveyed vertically from the l \u2212 1 layer to the current node\nat the same time step. For the bottom ST-LSTM layer where l = 1, Ml\u22121\nt\u22121, as described in\nthe previous subsection. We construct another set of gate structures for Ml\nt, while maintaining the\noriginal gates for Cl\nt in standard LSTMs. At last, the \ufb01nal hidden states of this node rely on the fused\nspatiotemporal memory. We concatenate these memory derived from different directions together and\nthen apply a 1 \u00d7 1 convolution layer for dimension reduction, which makes the hidden state Hl\nt of the\nsame dimensions as the memory cells. Different from simple memory concatenation, the ST-LSTM\nunit uses a shared output gate for both memory types to enable seamless memory fusion, which can\neffectively model the shape deformations and motion trajectories in the spatiotemporal sequences.\n\nt = ML\n\n5\n\nInput GateOutput GateForget GateInput Modulation GateStandard TemporalMemorygtitftotMtl\u22121CtlCt\u22121l\u2032ftMtlSpatio-temporalMemory\u2297\u2032it\u2032gtHtlHt\u22121lXt\u2297\u2297\u2297\u2297\u02c6Xt+1\u02c6Xt+2\u02c6XtXtXt+1Xt\u22121Mtl=3Mtl=2Mtl=1Htl=3Htl=2Htl=1Htl=1Htl=2Htl=3Ctl=1Ctl=2Ctl=3Ctl=4,Htl=4W1W2W3W4W1W2W3W4W1W2W3W4Mt\u22121l=4Mtl=4\f4 Experiments\n\nOur model is demonstrated to achieve the state-of-the-art performance on three video prediction\ndatasets including both synthetic and natural video sequences. Our PredRNN model is optimized\nwith a L1 + L2 loss (other losses have been tried, but L1 + L2 loss works best). All models are\ntrained using the ADAM optimizer [11] with a starting learning rate of 10\u22123. The training process is\nstopped after 80, 000 iterations. Unless otherwise speci\ufb01ed, the batch size of each iteration is set to 8.\nAll experiments are implemented in TensorFlow [1] and conducted on NVIDIA TITAN-X GPUs.\n\n4.1 Moving MNIST dataset\n\nImplementation We generate Moving MNIST sequences with the method described in [21]. Each\nsequence consists of 20 consecutive frames, 10 for the input and 10 for the prediction. Each frame\ncontains two or three handwritten digits bouncing inside a 64 \u00d7 64 grid of image. The digits were\nchosen randomly from the MNIST training set and placed initially at random locations. For each\ndigit, we assign a velocity whose direction is randomly chosen by a uniform distribution on a unit\ncircle, and whose amplitude is chosen randomly in [3, 5). The digits bounce-off the edges of image\nand occlude each other when reaching the same location. These properties make it hard for a model to\ngive accurate predictions without learning the inner dynamics of the movement. With digits generated\nquickly on the \ufb02y, we are able to have in\ufb01nite samples size in the training set. The test set is \ufb01xed,\nconsisting of 5,000 sequences. We sample digits from the MNIST test set, assuring the trained model\nhas never seen them before. Also, the model trained with two digits is tested on another Moving\nMNIST dataset with three digits. Such a test setup is able to measure PredRNN\u2019s generalization and\ntransfer ability, because no frames containing three digits are given throughout the training period.\nAs a strong competitor, we include the latest state-of-the-art VPN model [10]. We \ufb01nd it hard to\nreproduce VPN\u2019s experimental results on Moving MNIST since it is not open source, thus we adopt\nits baseline version that uses CNNs instead of PixelCNNs as its decoder and generate each frame in\none pass. We observe that the total number of hidden states has a strong impact on the \ufb01nal accuracy\nof PredRNN. After a number of trials, we present a 4-layer architecture with 128 hidden states in each\nlayer, which yields a high prediction accuracy using reasonable training time and memory footprint.\nTable 1: Results of PredRNN with spatiotemporal memory M, PredRNN with ST-LSTMs, and\nstate-of-the-art models. We report per-frame MSE and Cross-Entropy (CE) of generated sequences\naveraged across the Moving MNIST test sets. Lower MSE or CE denotes better prediction accuracy.\n\nModel\n\nMNIST-2\n(CE/frame)\n\nMNIST-2\n\n(MSE/frame)\n\nMNIST-3\n\n(MSE/frame)\n\n483.2\nFC-LSTM [21]\nConvLSTM (128 \u00d7 4) [19]\n367.0\n346.6\nCDNA [6]\n285.2\nDFN [3]\n110.1\nVPN baseline [10]\nPredRNN with spatiotemporal memory M 118.5\nPredRNN + ST-LSTM (128 \u00d7 4)\n97.0\n\n118.3\n103.3\n97.4\n89.0\n70.0\n74.0\n56.8\n\n162.4\n142.1\n138.2\n130.5\n125.2\n118.2\n93.4\n\nResults As an ablation study, PredRNN only with a zigzag memory \ufb02ow reduces the per-frame\nMSE to 74.0 on the Moving MNIST-2 test set (see Table 1). By replacing convolutional LSTMs\nwith ST-LSTMs, we further decline the sequence MSE from 74.0 down to 56.8. The corresponding\nframe-by-frame quantitative comparisons are presented in Figure 3. Compared with VPN, our model\nturns out to be more accurate for long-term predictions, especially on Moving MNIST-3. We also\nuse per-frame cross-entropy likelihood as another evaluation metric on Moving MNIST-2. PredRNN\nwith ST-LSTMs signi\ufb01cantly outperforms all previous methods, while PredRNN with spatiotemporal\nmemory M performs comparably with VPN baseline.\nA qualitative comparison of predicted video sequences is given in Figure 4. Though VPN\u2019s generated\nframes look a bit sharper, its predictions gradually deviate from the correct trajectories, as illustrated\nin the \ufb01rst example. Moreover, for those sequences that digits are overlapped and entangled, VPN has\ndif\ufb01culties in separating these digits clearly while maintaining their individual shapes. For example,\n\n6\n\n\fin the right \ufb01gure, digit \u201c8\u201d loses its left-side pixels and is predicted as \u201c3\u201d after overlapping. Other\nbaseline models suffer from a severer blur effect, especially for longer future time steps. By contrast,\nPredRNN\u2019s results are not only sharp enough but also more accurate for long-term motion predictions.\n\n(a) MNIST-2\n\n(b) MNIST-3\n\nFigure 3: Frame-wise MSE comparisons of different models on the Moving MNIST test sets.\n\nFigure 4: Prediction examples on the Moving MNIST-2 test set.\n\n4.2 KTH action dataset\n\nImplementation The KTH action dataset [18] contains six types of human actions (walking,\njogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in\nfour different scenarios: outdoors, outdoors with scale variations, outdoors with different clothes and\nindoors. All video clips were taken over homogeneous backgrounds with a static camera in 25fps\nframe rate and have a length of four seconds in average. To make the results comparable, we adopt\nthe experiment setup in [25] that video frames are resized into 128 \u00d7 128 pixels and all videos are\ndivided with respect to the subjects into a training set (persons 1-16) and a test set (persons 17-25).\nAll models, including PredRNN as well as the baselines, are trained on the training set across all six\naction categories by generating the subsequent 10 frames from the last 10 observations, while the the\npresented prediction results in Figure 5 and Figure 6 are obtained on the test set by predicting 20 time\nsteps into the future. We sample sub-clips using a 20-frame-wide sliding window with a stride of 1 on\nthe training set. As for evaluation, we broaden the sliding window to 30-frame-wide and set the stride\nto 3 for running and jogging, while 20 for the other categories. Sub-clips for running, jogging, and\nwalking are manually trimmed to ensure humans are always present in the frame sequences. In the\nend, we split the database into a training set of 108,717 sequences and a test set of 4,086 sequences.\n\nResults We use the Peak Signal to Noise Ratio (PSNR) and the Structural Similarity Index Mea-\nsure (SSIM) [26] as metrics to evaluate the prediction results and provide frame-wise quantitative\ncomparisons in Figure 5. A higher value denotes a better prediction performance. The value of SSIM\nranges between -1 and 1, and a larger score means a greater similarity between two images. PredRNN\nconsistently outperforms the comparison models. Speci\ufb01cally, the Predictive Coding Network [12]\nalways exploits the whole ground truth sequence before the current time step to predict the next\n\n7\n\ntime steps12345678910Mean Square Error020406080100120140ConvLSTMCDNADFNVPN baselinePredRNN + ST-LSTMtime steps12345678910Mean Square Error20406080100120140160180ConvLSTMCDNADFNVPN baselinePredRNN + ST-LSTMInput framesGround truthPredRNNConvLSTMCDNAVPN baseline\fframe. Thus, it cannot make sequence predictions. Here, we make it predict the next 20 frames by\nfeeding the 10 ground truth frames and the recursively generated frames in all previous time steps.\nThe performance of MCnet [25] deteriorates quickly for long-term predictions. Residual connections\nof MCnet convey the CNN features of the last frame to the decoder and ignore the previous frames,\nwhich emphasizes the spatial appearances while weakens temporal variations. By contrast, results of\nPredRNN in both metrics remain stable over time, only with a slow and reasonable decline. Figure 6\nvisualizes a sample video sequence from the KTH test set. The ConvLSTM network [19] generates\nblurred future frames, since it fails to memorize the detailed spatial representations. MCnet [25]\nproduces sharper images but is not able to forecast the movement trajectory accurately. Thanks\nto ST-LSTMs, PredRNN memorizes detailed visual appearances as well as long-term motions. It\noutperforms all baseline models and shows superior predicting power both spatially and temporally.\n\n(a) Frame-wise PSNR\n\n(b) Frame-wise SSIM\n\nFigure 5: Frame-wise PSNR and SSIM comparisons of different models on the KTH action test set.\nA higher score denotes a better prediction accuracy.\n\nFigure 6: KTH prediction samples. We predict 20 frames into the future by observing 10 frames.\n\n4.3 Radar echo dataset\n\nPredicting the shape and movement of future radar echoes is a real application of predictive learning\nand is the foundation of precipitation nowcasting. It is a more challenging task because radar echoes\nare not rigid. Also, their speeds are not as \ufb01xed as moving digits, their trajectories are not as periodical\nas KTH actions, and their shapes may accumulate, dissipate or change rapidly due to the complex\natmospheric environment. Modeling spatial deformation is signi\ufb01cant for the prediction of this data.\n\n8\n\ntime steps2468101214161820Peak Signal Noise Ratio20253035ConvLSTMMCnet + Res.Predictive Coding NetworkPredRNN + ST-LSTMstime steps2468101214161820Structural Similarity0.60.70.80.91ConvLSTMMCnet + Res.Predictive Coding NetworkPredRNN + ST-LSTMst=3t=6t=9t=12t=15t=18t=21t=24t=27PredRNNConvLSTMMCnet + Res.t=30Input sequence Ground truth and predictions Predictive Coding Network\fImplementation We \ufb01rst collect the radar echo dataset by adapting the data handling method\ndescribed in [19]. Our dataset consists of 10,000 consecutive radar observations, recorded every 6\nminutes in Guangzhou, China. For preprocessing, we \ufb01rst map the radar intensities to pixel values,\nand represent them as 100 \u00d7 100 gray-scale images. Then we slice the consecutive images with a\n20-frame-wide sliding window. Thus, each sequence consists of 20 frames, 10 for the input, and 10\nfor forecasting. The total 9,600 sequences are split into a training set of 7,800 samples and a test\nset of 1,800 samples. The PredRNN model consists of two ST-LSTM layers with 128 hidden states\neach. The convolution \ufb01lters inside ST-LSTMs are set to 3 \u00d7 3. After prediction, we transform the\nresulted echo intensities into colored radar maps, as shown in Figure 7, and then calculate the amount\nof precipitation at each grid cell of these radar maps using Z-R relationships. Since it would bring in\nan additional systematic error to rainfall prediction and makes \ufb01nal results misleading, we do not take\nthem into account in this paper, but only compare the predicted echo intensity with the ground truth.\n\nResults Two baseline models are considered. The ConvLSTM network [19] is the \ufb01rst architecture\nthat models sequential radar maps with convolutional LSTMs, but its predictions tend to blur and\nobviously inaccurate (see Figure 7). As a strong competitor, we also include the latest state-of-the-art\nVPN model [10]. The PixelCNN-based VPN predicts an image pixel by pixel recursively, which\ntakes around 15 minutes to generate a radar map. Given that precipitation nowcasting has a high\ndemand on real-time computing, we trade off both prediction accuracy and computation ef\ufb01ciency\nand adopt VPN\u2019s baseline model that uses CNNs as its decoders and generates each frame in one pass.\nTable 2 shows that the prediction error of PredRNN is signi\ufb01cantly lower than VPN baseline. Though\nVPN generates more accurate radar maps for the near future, it suffers from a rapid decay for the\nlong term. Such a phenomenon results from a lack of strong LSTM layers to model spatiotemporal\nvariations. Furthermore, PredRNN takes only 1/5 memory space and training time as VPN baseline.\n\nTable 2: Quantitative results of different methods on the radar echo dataset.\n\nModel\nConvLSTM [19]\nVPN baseline [10]\nPredRNN\n\nMSE/frame Training time/100 batches Memory usage\n\n68.0\n60.7\n44.2\n\n105 s\n539 s\n117 s\n\n1756 MB\n11513 MB\n2367 MB\n\nFigure 7: A prediction example on the radar echo test set.\n\n5 Conclusions\n\nIn this paper, we propose a novel end-to-end recurrent network named PredRNN for spatiotemporal\npredictive learning that models spatial deformations and temporal variations simultaneously. Memory\nstates zigzag across stacked LSTM layers vertically and through all time states horizontally. Fur-\nthermore, we introduce a new spatiotemporal LSTM (ST-LSTM) unit with a gate-controlled dual\nmemory structure as the key building block of PredRNN. Our model achieves the state-of-the-art\nperformance on three video prediction datasets including both synthetic and natural video sequences.\n\n9\n\nPredRNNConvLSTMVPN baselineInput framesGround truth\fAcknowledgments\n\nThis work was supported by the National Key R&D Program of China (2016YFB1000701), National\nNatural Science Foundation of China (61772299, 61325008, 61502265, 61672313) and TNList Fund.\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, et al. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems. arXiv\npreprint arXiv:1603.04467, 2016.\n\n[2] K. Cho, B. Van Merri\u00ebnboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation:\n\nEncoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n\n[3] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic \ufb01lter networks. In NIPS, 2016.\n[4] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of\n\nadversarial networks. In NIPS, pages 1486\u20131494, 2015.\n\n[5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell.\nLong-term recurrent convolutional networks for visual recognition and description. In CVPR, pages\n2625\u20132634, 2015.\n\n[6] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video\n\nprediction. In NIPS, 2016.\n\n[7] I. J. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial networks. NIPS, 3:2672\u20132680, 2014.\n\n[8] A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML,\n\npages 1764\u20131772, 2014.\n\n[9] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[10] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu.\n\nVideo pixel networks. In ICML, 2017.\n\n[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[12] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised\n\nlearning. In International Conference on Learning Representations (ICLR), 2017.\n\n[13] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In\n\nICLR, 2016.\n\n[14] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks\n\nin atari games. In NIPS, pages 2863\u20132871, 2015.\n\n[15] V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory.\n\nIn ICLR Workshop, 2016.\n\n[16] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a\n\nbaseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.\n\n[17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.\n\nCognitive modeling, 5(3):1, 1988.\n\n[18] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In International\n\nConference on Pattern Recognition, pages 32\u201336 Vol.3, 2004.\n\n[19] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A\n\nmachine learning approach for precipitation nowcasting. In NIPS, pages 802\u2013810, 2015.\n\n[20] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In\n\n[21] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using\n\n[22] I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In ICML, pages\n\n[23] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS,\n\nNIPS, pages 568\u2013576, 2014.\n\nlstms. In ICML, 2015.\n\n1017\u20131024, 2011.\n\n4:3104\u20133112, 2014.\n\n[24] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation\n\nwith pixelcnn decoders. In NIPS, pages 4790\u20134798, 2016.\n\n[25] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video\n\nsequence prediction. In International Conference on Learning Representations (ICLR), 2017.\n\n[26] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility\n\nto structural similarity. TIP, 13(4):600, 2004.\n\n[27] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE,\n\n78(10):1550\u20131560, 1990.\n\n[28] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond\n\nshort snippets: Deep networks for video classi\ufb01cation. In CVPR, pages 4694\u20134702, 2015.\n\n10\n\n\f", "award": [], "sourceid": 573, "authors": [{"given_name": "Yunbo", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Mingsheng", "family_name": "Long", "institution": "Tsinghua University"}, {"given_name": "Jianmin", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Zhifeng", "family_name": "Gao", "institution": "Tsinghua University"}, {"given_name": "Philip", "family_name": "Yu", "institution": "UIC"}]}