{"title": "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting", "book": "Advances in Neural Information Processing Systems", "page_first": 802, "page_last": 810, "abstract": "The goal of precipitation nowcasting is to predict the future rainfall intensity in a local region over a relatively short period of time. Very few previous studies have examined this crucial and challenging weather forecasting problem from the machine learning perspective. In this paper, we formulate precipitation nowcasting as a spatiotemporal sequence forecasting problem in which both the input and the prediction target are spatiotemporal sequences. By extending the fully connected LSTM (FC-LSTM) to have convolutional structures in both the input-to-state and state-to-state transitions, we propose the convolutional LSTM (ConvLSTM) and use it to build an end-to-end trainable model for the precipitation nowcasting problem. Experiments show that our ConvLSTM network captures spatiotemporal correlations better and consistently outperforms FC-LSTM and the state-of-the-art operational ROVER algorithm for precipitation nowcasting.", "full_text": "Convolutional LSTM Network: A Machine Learning\n\nApproach for Precipitation Nowcasting\n\nXingjian Shi Zhourong Chen Hao Wang Dit-Yan Yeung\n\nDepartment of Computer Science and Engineering\nHong Kong University of Science and Technology\n\n{xshiab,zchenbb,hwangaz,dyyeung}@cse.ust.hk\n\nWai-kin Wong Wang-chun Woo\n\nHong Kong Observatory\n\nHong Kong, China\n\n{wkwong,wcwoo}@hko.gov.hk\n\nAbstract\n\nThe goal of precipitation nowcasting is to predict the future rainfall intensity in a\nlocal region over a relatively short period of time. Very few previous studies have\nexamined this crucial and challenging weather forecasting problem from the ma-\nchine learning perspective. In this paper, we formulate precipitation nowcasting\nas a spatiotemporal sequence forecasting problem in which both the input and the\nprediction target are spatiotemporal sequences. By extending the fully connected\nLSTM (FC-LSTM) to have convolutional structures in both the input-to-state and\nstate-to-state transitions, we propose the convolutional LSTM (ConvLSTM) and\nuse it to build an end-to-end trainable model for the precipitation nowcasting prob-\nlem. Experiments show that our ConvLSTM network captures spatiotemporal\ncorrelations better and consistently outperforms FC-LSTM and the state-of-the-\nart operational ROVER algorithm for precipitation nowcasting.\n\n1\n\nIntroduction\n\nNowcasting convective precipitation has long been an important problem in the \ufb01eld of weather\nforecasting. The goal of this task is to give precise and timely prediction of rainfall intensity in a\nlocal region over a relatively short period of time (e.g., 0-6 hours). It is essential for taking such\ntimely actions as generating society-level emergency rainfall alerts, producing weather guidance for\nairports, and seamless integration with a longer-term numerical weather prediction (NWP) model.\nSince the forecasting resolution and time accuracy required are much higher than other traditional\nforecasting tasks like weekly average temperature prediction, the precipitation nowcasting problem\nis quite challenging and has emerged as a hot research topic in the meteorology community [22].\nExisting methods for precipitation nowcasting can roughly be categorized into two classes [22],\nnamely, NWP based methods and radar echo1 extrapolation based methods. For the NWP approach,\nmaking predictions at the nowcasting timescale requires a complex and meticulous simulation of\nthe physical equations in the atmosphere model. Thus the current state-of-the-art operational pre-\ncipitation nowcasting systems [19, 6] often adopt the faster and more accurate extrapolation based\nmethods. Speci\ufb01cally, some computer vision techniques, especially optical \ufb02ow based methods,\nhave proven useful for making accurate extrapolation of radar maps [10, 6, 20]. One recent progress\nalong this path is the Real-time Optical \ufb02ow by Variational methods for Echoes of Radar (ROVER)\n\n1In real-life systems, radar echo maps are often constant altitude plan position indicator (CAPPI) images [9].\n\n1\n\n\falgorithm [25] proposed by the Hong Kong Observatory (HKO) for its Short-range Warning of\nIntense Rainstorms in Localized System (SWIRLS) [15]. ROVER calculates the optical \ufb02ow of\nconsecutive radar maps using the algorithm in [5] and performs semi-Lagrangian advection [4] on\nthe \ufb02ow \ufb01eld, which is assumed to be still, to accomplish the prediction. However, the success of\nthese optical \ufb02ow based methods is limited because the \ufb02ow estimation step and the radar echo ex-\ntrapolation step are separated and it is challenging to determine the model parameters to give good\nprediction performance.\nThese technical issues may be addressed by viewing the problem from the machine learning per-\nspective.\nIn essence, precipitation nowcasting is a spatiotemporal sequence forecasting problem\nwith the sequence of past radar maps as input and the sequence of a \ufb01xed number (usually larger\nthan 1) of future radar maps as output.2 However, such learning problems, regardless of their exact\napplications, are nontrivial in the \ufb01rst place due to the high dimensionality of the spatiotemporal\nsequences especially when multi-step predictions have to be made, unless the spatiotemporal struc-\nture of the data is captured well by the prediction model. Moreover, building an effective prediction\nmodel for the radar echo data is even more challenging due to the chaotic nature of the atmosphere.\nRecent advances in deep learning, especially recurrent neural network (RNN) and long short-term\nmemory (LSTM) models [12, 11, 7, 8, 23, 13, 18, 21, 26], provide some useful insights on how\nto tackle this problem. According to the philosophy underlying the deep learning approach, if we\nhave a reasonable end-to-end model and suf\ufb01cient data for training it, we are close to solving the\nproblem. The precipitation nowcasting problem satis\ufb01es the data requirement because it is easy\nto collect a huge amount of radar echo data continuously. What is needed is a suitable model for\nend-to-end learning. The pioneering LSTM encoder-decoder framework proposed in [23] provides a\ngeneral framework for sequence-to-sequence learning problems by training temporally concatenated\nLSTMs, one for the input sequence and another for the output sequence. In [18], it is shown that\nprediction of the next video frame and interpolation of intermediate frames can be done by building\nan RNN based language model on the visual words obtained by quantizing the image patches. They\npropose a recurrent convolutional neural network to model the spatial relationships but the model\nonly predicts one frame ahead and the size of the convolutional kernel used for state-to-state tran-\nsition is restricted to 1. Their work is followed up later in [21] which points out the importance\nof multi-step prediction in learning useful representations. They build an LSTM encoder-decoder-\npredictor model which reconstructs the input sequence and predicts the future sequence simultane-\nously. Although their method can also be used to solve our spatiotemporal sequence forecasting\nproblem, the fully connected LSTM (FC-LSTM) layer adopted by their model does not take spatial\ncorrelation into consideration.\nIn this paper, we propose a novel convolutional LSTM (ConvLSTM) network for precipitation now-\ncasting. We formulate precipitation nowcasting as a spatiotemporal sequence forecasting problem\nthat can be solved under the general sequence-to-sequence learning framework proposed in [23]. In\norder to model well the spatiotemporal relationships, we extend the idea of FC-LSTM to ConvLSTM\nwhich has convolutional structures in both the input-to-state and state-to-state transitions. By stack-\ning multiple ConvLSTM layers and forming an encoding-forecasting structure, we can build an\nend-to-end trainable model for precipitation nowcasting. For evaluation, we have created a new\nreal-life radar echo dataset which can facilitate further research especially on devising machine\nlearning algorithms for the problem. When evaluated on a synthetic Moving-MNIST dataset [21]\nand the radar echo dataset, our ConvLSTM model consistently outperforms both the FC-LSTM and\nthe state-of-the-art operational ROVER algorithm.\n\n2 Preliminaries\n\n2.1 Formulation of Precipitation Nowcasting Problem\n\nThe goal of precipitation nowcasting is to use the previously observed radar echo sequence to fore-\ncast a \ufb01xed length of the future radar maps in a local region (e.g., Hong Kong, New York, or Tokyo).\nIn real applications, the radar maps are usually taken from the weather radar every 6-10 minutes and\nnowcasting is done for the following 1-6 hours, i.e., to predict the 6-60 frames ahead. From the ma-\n\n2It is worth noting that our precipitation nowcasting problem is different from the one studied in [14], which\n\naims at predicting only the central region of just the next frame.\n\n2\n\n\fchine learning perspective, this problem can be regarded as a spatiotemporal sequence forecasting\nproblem.\nSuppose we observe a dynamical system over a spatial region represented by an M \u00d7 N grid which\nconsists of M rows and N columns. Inside each cell in the grid, there are P measurements which\nvary over time. Thus, the observation at any time can be represented by a tensor X \u2208 RP\u00d7M\u00d7N ,\nwhere R denotes the domain of the observed features. If we record the observations periodically, we\nwill get a sequence of tensors \u02c6X1, \u02c6X2, . . . , \u02c6Xt. The spatiotemporal sequence forecasting problem is\nto predict the most likely length-K sequence in the future given the previous J observations which\ninclude the current one:\n\n\u02dcXt+1, . . . , \u02dcXt+K = arg max\nXt+1,...,Xt+K\n\np(Xt+1, . . . ,Xt+K | \u02c6Xt\u2212J+1, \u02c6Xt\u2212J+2, . . . , \u02c6Xt)\n\n(1)\n\nFor precipitation nowcasting, the observation at every timestamp is a 2D radar echo map. If we\ndivide the map into tiled non-overlapping patches and view the pixels inside a patch as its measure-\nments (see Fig. 1), the nowcasting problem naturally becomes a spatiotemporal sequence forecasting\nproblem.\nWe note that our spatiotemporal sequence forecasting problem is different from the one-step time\nseries forecasting problem because the prediction target of our problem is a sequence which contains\nboth spatial and temporal structures. Although the number of free variables in a length-K sequence\ncan be up to O(M KN KP K), in practice we may exploit the structure of the space of possible\npredictions to reduce the dimensionality and hence make the problem tractable.\n\n2.2 Long Short-Term Memory for Sequence Modeling\n\nFor general-purpose sequence modeling, LSTM as a special RNN structure has proven stable and\npowerful for modeling long-range dependencies in various previous studies [12, 11, 17, 23]. The\nmajor innovation of LSTM is its memory cell ct which essentially acts as an accumulator of the\nstate information. The cell is accessed, written and cleared by several self-parameterized controlling\ngates. Every time a new input comes, its information will be accumulated to the cell if the input gate\nit is activated. Also, the past cell status ct\u22121 could be \u201cforgotten\u201d in this process if the forget gate\nft is on. Whether the latest cell output ct will be propagated to the \ufb01nal state ht is further controlled\nby the output gate ot. One advantage of using the memory cell and gates to control information \ufb02ow\nis that the gradient will be trapped in the cell (also known as constant error carousels [12]) and be\nprevented from vanishing too quickly, which is a critical problem for the vanilla RNN model [12,\n17, 2]. FC-LSTM may be seen as a multivariate version of LSTM where the input, cell output and\nstates are all 1D vectors. In this paper, we follow the formulation of FC-LSTM as in [11]. The key\nequations are shown in (2) below, where \u2018\u25e6\u2019 denotes the Hadamard product:\n\nit = \u03c3(Wxixt + Whiht\u22121 + Wci \u25e6 ct\u22121 + bi)\nft = \u03c3(Wxf xt + Whf ht\u22121 + Wcf \u25e6 ct\u22121 + bf )\nct = ft \u25e6 ct\u22121 + it \u25e6 tanh(Wxcxt + Whcht\u22121 + bc)\not = \u03c3(Wxoxt + Whoht\u22121 + Wco \u25e6 ct + bo)\nht = ot \u25e6 tanh(ct)\n\n(2)\n\nMultiple LSTMs can be stacked and temporally concatenated to form more complex structures.\nSuch models have been applied to solve many real-life sequence modeling problems [23, 26].\n\n3 The Model\n\nWe now present our ConvLSTM network. Although the FC-LSTM layer has proven powerful for\nhandling temporal correlation, it contains too much redundancy for spatial data. To address this\nproblem, we propose an extension of FC-LSTM which has convolutional structures in both the\ninput-to-state and state-to-state transitions. By stacking multiple ConvLSTM layers and forming an\nencoding-forecasting structure, we are able to build a network model not only for the precipitation\nnowcasting problem but also for more general spatiotemporal sequence forecasting problems.\n\n3\n\n\fFigure 1: Transforming 2D image\ninto 3D tensor\n\n3.1 Convolutional LSTM\n\nFigure 2: Inner structure of ConvLSTM\n\nThe major drawback of FC-LSTM in handling spatiotemporal data is its usage of full connections in\ninput-to-state and state-to-state transitions in which no spatial information is encoded. To overcome\nthis problem, a distinguishing feature of our design is that all the inputs X1, . . . ,Xt, cell outputs\nC1, . . . ,Ct, hidden states H1, . . . ,Ht, and gates it, ft, ot of the ConvLSTM are 3D tensors whose\nlast two dimensions are spatial dimensions (rows and columns). To get a better picture of the inputs\nand states, we may imagine them as vectors standing on a spatial grid. The ConvLSTM determines\nthe future state of a certain cell in the grid by the inputs and past states of its local neighbors.\nThis can easily be achieved by using a convolution operator in the state-to-state and input-to-state\ntransitions (see Fig. 2). The key equations of ConvLSTM are shown in (3) below, where \u2018\u2217\u2019 denotes\nthe convolution operator and \u2018\u25e6\u2019, as before, denotes the Hadamard product:\nit = \u03c3(Wxi \u2217 Xt + Whi \u2217 Ht\u22121 + Wci \u25e6 Ct\u22121 + bi)\nft = \u03c3(Wxf \u2217 Xt + Whf \u2217 Ht\u22121 + Wcf \u25e6 Ct\u22121 + bf )\nCt = ft \u25e6 Ct\u22121 + it \u25e6 tanh(Wxc \u2217 Xt + Whc \u2217 Ht\u22121 + bc)\not = \u03c3(Wxo \u2217 Xt + Who \u2217 Ht\u22121 + Wco \u25e6 Ct + bo)\nHt = ot \u25e6 tanh(Ct)\n\nIf we view the states as the hidden representations of moving objects, a ConvLSTM with a larger\ntransitional kernel should be able to capture faster motions while one with a smaller kernel can\ncapture slower motions. Also, if we adopt a similar view as [16], the inputs, cell outputs and hidden\nstates of the traditional FC-LSTM represented by (2) may also be seen as 3D tensors with the last\ntwo dimensions being 1. In this sense, FC-LSTM is actually a special case of ConvLSTM with all\nfeatures standing on a single cell.\nTo ensure that the states have the same number of rows and same number of columns as the inputs,\npadding is needed before applying the convolution operation. Here, padding of the hidden states on\nthe boundary points can be viewed as using the state of the outside world for calculation. Usually,\nbefore the \ufb01rst input comes, we initialize all the states of the LSTM to zero which corresponds to\n\u201ctotal ignorance\u201d of the future. Similarly, if we perform zero-padding (which is used in this paper)\non the hidden states, we are actually setting the state of the outside world to zero and assume no prior\nknowledge about the outside. By padding on the states, we can treat the boundary points differently,\nwhich is helpful in many cases. For example, imagine that the system we are observing is a moving\nball surrounded by walls. Although we cannot see these walls, we can infer their existence by \ufb01nding\nthe ball bouncing over them again and again, which can hardly be done if the boundary points have\nthe same state transition dynamics as the inner points.\n\n(3)\n\n3.2 Encoding-Forecasting Structure\n\nLike FC-LSTM, ConvLSTM can also be adopted as a building block for more complex structures.\nFor our spatiotemporal sequence forecasting problem, we use the structure shown in Fig. 3 which\nconsists of two networks, an encoding network and a forecasting network. Like in [21], the initial\nstates and cell outputs of the forecasting network are copied from the last state of the encoding\nnetwork. Both networks are formed by stacking several ConvLSTM layers. As our prediction target\nhas the same dimensionality as the input, we concatenate all the states in the forecasting network\nand feed them into a 1 \u00d7 1 convolutional layer to generate the \ufb01nal prediction.\nWe can interpret this structure using a similar viewpoint as [23]. The encoding LSTM compresses\nthe whole input sequence into a hidden state tensor and the forecasting LSTM unfolds this hidden\n\n4\n\n2DImage3DTensorPPHt\u22121,Ct\u22121Ht,CtHt+1,Ct+1XtXt+1\fFigure 3: Encoding-forecasting ConvLSTM network for precipitation nowcasting\n\nstate to give the \ufb01nal prediction:\n\u02dcXt+1, . . . , \u02dcXt+K = arg max\nXt+1,...,Xt+K\n\u2248 arg max\nXt+1,...,Xt+K\n\u2248 gforecasting (fencoding ( \u02c6Xt\u2212J+1, \u02c6Xt\u2212J+2, . . . , \u02c6Xt))\n\np(Xt+1, . . . ,Xt+K | \u02c6Xt\u2212J+1, \u02c6Xt\u2212J+2, . . . , \u02c6Xt)\np(Xt+1, . . . ,Xt+K | fencoding ( \u02c6Xt\u2212J+1, \u02c6Xt\u2212J+2, . . . , \u02c6Xt))\n\nThis structure is also similar to the LSTM future predictor model in [21] except that our input and\noutput elements are all 3D tensors which preserve all the spatial information. Since the network has\nmultiple stacked ConvLSTM layers, it has strong representational power which makes it suitable\nfor giving predictions in complex dynamical systems like the precipitation nowcasting problem we\nstudy here.\n\n(4)\n\n4 Experiments\n\nWe \ufb01rst compare our ConvLSTM network with the FC-LSTM network on a synthetic Moving-\nMNIST dataset to gain some basic understanding of the behavior of our model. We run our model\nwith different number of layers and kernel sizes and also study some \u201cout-of-domain\u201d cases as\nin [21]. To verify the effectiveness of our model on the more challenging precipitation nowcasting\nproblem, we build a new radar echo dataset and compare our model with the state-of-the-art ROVER\nalgorithm based on several commonly used precipitation nowcasting metrics. The results of the\nexperiments conducted on these two datasets lead to the following \ufb01ndings:\n\nthe spatiotemporal motion patterns.\n\n\u2022 ConvLSTM is better than FC-LSTM in handling spatiotemporal correlations.\n\u2022 Making the size of state-to-state convolutional kernel bigger than 1 is essential for capturing\n\u2022 Deeper models can produce better results with fewer parameters.\n\u2022 ConvLSTM performs better than ROVER for precipitation nowcasting.\n\nOur implementations of the models are in Python with the help of Theano [3, 1]. We run all the\nexperiments on a computer with a single NVIDIA K20 GPU. Also, more illustrative \u201cgif\u201d examples\nare included in the appendix.\n\n4.1 Moving-MNIST Dataset\n\nFor this synthetic dataset, we use a generation process similar to that described in [21]. All data\ninstances in the dataset are 20 frames long (10 frames for the input and 10 frames for the predic-\ntion) and contain two handwritten digits bouncing inside a 64 \u00d7 64 patch. The moving digits are\nchosen randomly from a subset of 500 digits in the MNIST dataset.3 The starting position and ve-\nlocity direction are chosen uniformly at random and the velocity amplitude is chosen randomly in\n[3, 5). This generation process is repeated 15000 times, resulting in a dataset with 10000 training\nsequences, 2000 validation sequences, and 3000 testing sequences. We train all the LSTM mod-\nels by minimizing the cross-entropy loss4 using back-propagation through time (BPTT) [2] and\n\n\u2212(cid:80)\n\n3MNIST dataset: http://yann.lecun.com/exdb/mnist/\n4The cross-entropy loss of the predicted frame P and the ground-truth frame T is de\ufb01ned as\ni,j,k Ti,j,k log Pi,j,k + (1 \u2212 Ti,j,k) log(1 \u2212 Pi,j,k).\n\n5\n\nConvLSTM2PredictionConvLSTM1InputConvLSTM4ConvLSTM3EncodingNetworkForecastingNetworkCopyCopy\fTable 1: Comparison of ConvLSTM networks with FC-LSTM network on the Moving-MNIST\ndataset. \u2018-5x5\u2019 and \u2018-1x1\u2019 represent the corresponding state-to-state kernel size, which is either 5\u00d75\nor 1\u00d71. \u2018256\u2019, \u2018128\u2019, and \u201864\u2019 refer to the number of hidden states in the ConvLSTM layers. \u2018(5x5)\u2019\nand \u2018(9x9)\u2019 represent the input-to-state kernel size.\n\nModel\nFC-LSTM-2048-2048\nConvLSTM(5x5)-5x5-256\nConvLSTM(5x5)-5x5-128-5x5-128\nConvLSTM(5x5)-5x5-128-5x5-64-5x5-64\nConvLSTM(9x9)-1x1-128-1x1-128\nConvLSTM(9x9)-1x1-128-1x1-64-1x1-64\n\nNumber of parameters Cross entropy\n4832.49\n3887.94\n3733.56\n3670.85\n4782.84\n4231.50\n\n142,667,776\n13,524,496\n10,042,896\n7,585,296\n11,550,224\n8,830,480\n\nFigure 4: An example showing an \u201cout-of-domain\u201d run. From left to right: input frames; ground\ntruth; prediction by the 3-layer network.\n\nRMSProp [24] with a learning rate of 10\u22123 and a decay rate of 0.9. Also, we perform early-stopping\non the validation set.\nDespite the simple generation process, there exist strong nonlinearities in the resulting dataset be-\ncause the moving digits can exhibit complicated appearance and will occlude and bounce during\ntheir movement. It is hard for a model to give accurate predictions on the test set without learning\nthe inner dynamics of the system.\nFor the FC-LSTM network, we use the same structure as the unconditional future predictor model\nin [21] with two 2048-node LSTM layers. For our ConvLSTM network, we set the patch size to\n4 \u00d7 4 so that each 64 \u00d7 64 frame is represented by a 16 \u00d7 16 \u00d7 16 tensor. We test three variants of\nour model with different number of layers. The 1-layer network contains one ConvLSTM layer with\n256 hidden states, the 2-layer network has two ConvLSTM layers with 128 hidden states each, and\nthe 3-layer network has 128, 64, and 64 hidden states respectively in the three ConvLSTM layers.\nAll the input-to-state and state-to-state kernels are of size 5 \u00d7 5. Our experiments show that the\nConvLSTM networks perform consistently better than the FC-LSTM network. Also, deeper models\ncan give better results although the improvement is not so signi\ufb01cant between the 2-layer and 3-layer\nnetworks. Moreover, we also try other network con\ufb01gurations with the state-to-state and input-to-\nstate kernels of the 2-layer and 3-layer networks changed to 1 \u00d7 1 and 9 \u00d7 9, respectively. Although\nthe number of parameters of the new 2-layer network is close to the original one, the result becomes\nmuch worse because it is hard to capture the spatiotemporal motion patterns with only 1\u00d7 1 state-to-\nstate transition. Meanwhile, the new 3-layer network performs better than the new 2-layer network\nsince the higher layer can see a wider scope of the input. Nevertheless, its performance is inferior\nto networks with larger state-to-state kernel size. This provides evidence that larger state-to-state\nkernels are more suitable for capturing spatiotemporal correlations. In fact, for 1 \u00d7 1 kernel, the\nreceptive \ufb01eld of the states will not grow as time advances. But for larger kernels, later states have\nlarger receptive \ufb01elds and are related to a wider range of the input. The average cross-entropy loss\n(cross-entropy loss per sequence) of each algorithm on the test set is shown in Table 1. We need\nto point out that our experiment setting is different from [21] where an in\ufb01nite number of training\ndata is assumed to be available. The current of\ufb02ine setting is chosen in order to understand how\ndifferent models perform in occasions where not so much data is available. Comparison of the\n3-layer ConvLSTM and FC-LSTM in the online setting is included in the appendix.\n\n6\n\n\fNext, we test our model on some \u201cout-of-domain\u201d inputs. We generate another 3000 sequences of\nthree moving digits, with the digits drawn randomly from a different subset of 500 MNIST digits\nthat does not overlap with the training set. Since the model has never seen any system with three\ndigits, such an \u201cout-of-domain\u201d run is a good test of the generalization ability of the model [21].\nThe average cross-entropy error of the 3-layer model on this dataset is 6379.42. By observing some\nof the prediction results, we \ufb01nd that the model can separate the overlapping digits successfully\nand predict the overall motion although the predicted digits are quite blurred. One \u201cout-of-domain\u201d\nprediction example is shown in Fig. 4.\n\n4.2 Radar Echo Dataset\n\nZ\u2212min{Z}\n\nmax{Z}\u2212min{Z}\n\nThe radar echo dataset used in this paper is a subset of the three-year weather radar intensities\ncollected in Hong Kong from 2011 to 2013. Since not every day is rainy and our nowcasting target\nis precipitation, we select the top 97 rainy days to form our dataset. For preprocessing, we \ufb01rst\ntransform the intensity values Z to gray-level pixels P by setting P =\nand crop\nthe radar maps in the central 330 \u00d7 330 region. After that, we apply the disk \ufb01lter5 with radius 10\nand resize the radar maps to 100 \u00d7 100. To reduce the noise caused by measuring instruments, we\nfurther remove the pixel values of some noisy regions which are determined by applying K-means\nclustering to the monthly pixel average. The weather radar data is recorded every 6 minutes, so there\nare 240 frames per day. To get disjoint subsets for training, testing and validation, we partition each\ndaily sequence into 40 non-overlapping frame blocks and randomly assign 4 blocks for training, 1\nblock for testing and 1 block for validation. The data instances are sliced from these blocks using\na 20-frame-wide sliding window. Thus our radar echo dataset contains 8148 training sequences,\n2037 testing sequences and 2037 validation sequences and all the sequences are 20 frames long (5\nfor the input and 15 for the prediction). Although the training and testing instances sliced from the\nsame day may have some dependencies, this splitting strategy is still reasonable because in real-life\nnowcasting, we do have access to all previous data, including data from the same day, which allows\nus to apply online \ufb01ne-tuning of the model. Such data splitting may be viewed as an approximation\nof the real-life \u201c\ufb01ne-tuning-enabled\u201d setting for this application.\nWe set the patch size to 2 and train a 2-layer ConvLSTM network with each layer containing 64\nhidden states and 3 \u00d7 3 kernels. For the ROVER algorithm, we tune the parameters of the optical\n\ufb02ow estimator6 on the validation set and use the best parameters (shown in the appendix) to report the\ntest results. Also, we try three different initialization schemes for ROVER: ROVER1 computes the\noptical \ufb02ow of the last two observed frames and performs semi-Lagrangian advection afterwards;\nROVER2 initializes the velocity by the mean of the last two \ufb02ow \ufb01elds; and ROVER3 gives the\ninitialization by a weighted average (with weights 0.7, 0.2 and 0.1) of the last three \ufb02ow \ufb01elds. In\naddition, we train an FC-LSTM network with two 2000-node LSTM layers. Both the ConvLSTM\nnetwork and the FC-LSTM network optimize the cross-entropy error of 15 predictions.\nWe evaluate these methods using several commonly used precipitation nowcasting metrics, namely,\nrainfall mean squared error (Rainfall-MSE), critical success index (CSI), false alarm rate (FAR),\nprobability of detection (POD), and correlation. The Rainfall-MSE metric is de\ufb01ned as the average\nsquared error between the predicted rainfall and the ground truth. Since our predictions are done at\nthe pixel level, we project them back to radar echo intensities and calculate the rainfall at every cell of\nthe grid using the Z-R relationship [15]: Z = 10 log a + 10b log R, where Z is the radar echo inten-\nsity in dB, R is the rainfall rate in mm/h, and a, b are two constants with a = 118.239, b = 1.5241.\nThe CSI, FAR and POD are skill scores similar to precision and recall commonly used by machine\nlearning researchers. We convert the prediction and ground truth to a 0/1 matrix using a threshold\nof 0.5mm/h rainfall rate (indicating raining or not) and calculate the hits (prediction = 1, truth = 1),\nmisses (prediction = 0, truth = 1) and false alarms (prediction = 1, truth = 0). The three skill scores\n(cid:80)\nhits+misses. The cor-\nare de\ufb01ned as CSI =\nrelation of a predicted frame P and a ground-truth frame T is de\ufb01ned as\nwhere \u03b5 = 10\u22129.\n\nhits+misses+falsealarms, FAR = falsealarms\n\nhits+falsealarms, POD =\n\n\u221a((cid:80)\n\ni,j Pi,j Ti,j\n\ni,j )((cid:80)\n\ni,j P 2\n\ni,j T 2\n\ni,j )+\u03b5\n\nhits\n\nhits\n\n5The disk \ufb01lter is applied using the MATLAB function fspecial(\u2019disk\u2019, 10).\n6We use an open-source project\n\nto calculate the optical \ufb02ow: http://sourceforge.net/\n\nprojects/varflow/\n\n7\n\n\fTable 2: Comparison of the average scores of different models over 15 prediction steps.\n\nCSI\n0.577\n0.516\n0.522\n0.522\n0.286\n\nFAR\n0.195\n0.308\n0.301\n0.301\n0.335\n\nPOD Correlation\n0.660\n0.636\n0.642\n0.642\n0.351\n\n0.908\n0.843\n0.850\n0.849\n0.774\n\nModel\nConvLSTM(3x3)-3x3-64-3x3-64\nRover1\nRover2\nRover3\nFC-LSTM-2000-2000\n\nRainfall-MSE\n\n1.420\n1.712\n1.684\n1.685\n1.865\n\nFigure 5: Comparison of different models\nbased on four precipitation nowcasting met-\nrics over time.\n\nFigure 6: Two prediction examples for the\nprecipitation nowcasting problem. All the\npredictions and ground truths are sampled with\nan interval of 3. From top to bottom:\ninput\nframes; ground truth frames; prediction by Con-\nvLSTM network; prediction by ROVER2.\n\nAll results are shown in Table 2 and Fig. 5. We can \ufb01nd that the performance of the FC-LSTM\nnetwork is not so good for this task, which is mainly caused by the strong spatial correlation in the\nradar maps, i.e., the motion of clouds is highly consistent in a local region. The fully-connected\nstructure has too many redundant connections and makes the optimization very unlikely to capture\nthese local consistencies. Also, it can be seen that ConvLSTM outperforms the optical \ufb02ow based\nROVER algorithm, which is mainly due to two reasons. First, ConvLSTM is able to handle the\nboundary conditions well. In real-life nowcasting, there are many cases when a sudden agglom-\neration of clouds appears at the boundary, which indicates that some clouds are coming from the\noutside. If the ConvLSTM network has seen similar patterns during training, it can discover this\ntype of sudden changes in the encoding network and give reasonable predictions in the forecasting\nnetwork. This, however, can hardly be achieved by optical \ufb02ow and semi-Lagrangian advection\nbased methods. Another reason is that, ConvLSTM is trained end-to-end for this task and some\ncomplex spatiotemporal patterns in the dataset can be learned by the nonlinear and convolutional\nstructure of the network. For the optical \ufb02ow based approach, it is hard to \ufb01nd a reasonable way to\nupdate the future \ufb02ow \ufb01elds and train everything end-to-end. Some prediction results of ROVER2\nand ConvLSTM are shown in Fig. 6. We can \ufb01nd that ConvLSTM can predict the future rainfall\ncontour more accurately especially in the boundary. Although ROVER2 can give sharper predic-\ntions than ConvLSTM, it triggers more false alarms and is less precise than ConvLSTM in general.\nAlso, the blurring effect of ConvLSTM may be caused by the inherent uncertainties of the task, i.e,\nit is almost impossible to give sharp and accurate predictions of the whole radar maps in longer-term\npredictions. We can only blur the predictions to alleviate the error caused by this type of uncertainty.\n\n5 Conclusion and Future Work\n\nIn this paper, we have successfully applied the machine learning approach, especially deep learning,\nto the challenging precipitation nowcasting problem which so far has not bene\ufb01ted from sophisti-\ncated machine learning techniques. We formulate precipitation nowcasting as a spatiotemporal se-\nquence forecasting problem and propose a new extension of LSTM called ConvLSTM to tackle the\nproblem. The ConvLSTM layer not only preserves the advantages of FC-LSTM but is also suitable\nfor spatiotemporal data due to its inherent convolutional structure. By incorporating ConvLSTM\ninto the encoding-forecasting structure, we build an end-to-end trainable model for precipitation\nnowcasting. For future work, we will investigate how to apply ConvLSTM to video-based action\nrecognition. One idea is to add ConvLSTM on top of the spatial feature maps generated by a con-\nvolutional neural network and use the hidden states of ConvLSTM for the \ufb01nal classi\ufb01cation.\n\n8\n\nTime051015FAR00.10.20.30.40.5Time051015POD0.20.40.60.81ConvLSTMROVER1ROVER2ROVER3FC-LSTMTime051015Correlation0.70.80.91Time051015CSI0.20.40.60.81\fReferences\n[1] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-\nFarley, and Y. Bengio. Theano: New features and speed improvements. Deep Learning and Unsupervised\nFeature Learning NIPS 2012 Workshop, 2012.\n\n[2] Y. Bengio, I. Goodfellow, and A. Courville. Deep Learning. Book in preparation for MIT Press, 2015.\n[3] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley,\nand Y. Bengio. Theano: a CPU and GPU math expression compiler. In Scipy, volume 4, page 3. Austin,\nTX, 2010.\n\n[4] R. Bridson. Fluid Simulation for Computer Graphics. Ak Peters Series. Taylor & Francis, 2008.\n[5] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical \ufb02ow estimation based on a\n\ntheory for warping. In ECCV, pages 25\u201336. 2004.\n\n[6] P. Cheung and H.Y. Yeung. Application of optical-\ufb02ow technique to signi\ufb01cant convection nowcast for\nterminal areas in Hong Kong. In the 3rd WMO International Symposium on Nowcasting and Very Short-\nRange Forecasting (WSN12), pages 6\u201310, 2012.\n\n[7] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase\nrepresentations using RNN encoder-decoder for statistical machine translation. In EMNLP, pages 1724\u2013\n1734, 2014.\n\n[8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell.\n\nLong-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.\n[9] R. H. Douglas. The stormy weather group (Canada). In Radar in Meteorology, pages 61\u201368. 1990.\n[10] Urs Germann and Isztar Zawadzki. Scale-dependence of the predictability of precipitation from continen-\ntal radar images. Part I: Description of the methodology. Monthly Weather Review, 130(12):2859\u20132873,\n2002.\n\n[11] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\n[13] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions.\n\nCVPR, 2015.\n\nIn\n\n[14] B. Klein, L. Wolf, and Y. Afek. A dynamic convolutional layer for short range weather prediction. In\n\nCVPR, 2015.\n\n[15] P.W. Li, W.K. Wong, K.Y. Chan, and E. S.T. Lai. SWIRLS-An Evolving Nowcasting System. Hong Kong\n\nSpecial Administrative Region Government, 2000.\n\n[16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\n2015.\n\n[17] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks. In ICML,\n\npages 1310\u20131318, 2013.\n\n[18] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a\n\nbaseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.\n\n[19] M. Reyniers. Quantitative Precipitation Forecasts Based on Radar Observations: Principles, Algorithms\n\nand Operational Systems. Institut Royal M\u00b4et\u00b4eorologique de Belgique, 2008.\n\n[20] H. Sakaino. Spatio-temporal image pattern prediction method based on a physical model with time-\nvarying optical \ufb02ow. IEEE Transactions on Geoscience and Remote Sensing, 51(5-2):3023\u20133036, 2013.\n[21] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using\n\nlstms. In ICML, 2015.\n\n[22] J. Sun, M. Xue, J. W. Wilson, I. Zawadzki, S. P. Ballard, J. Onvlee-Hooimeyer, P. Joe, D. M. Barker,\nP. W. Li, B. Golding, M. Xu, and J. Pinto. Use of NWP for nowcasting convective precipitation: Recent\nprogress and challenges. Bulletin of the American Meteorological Society, 95(3):409\u2013426, 2014.\n\n[23] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS,\n\npages 3104\u20133112, 2014.\n\n[24] T. Tieleman and G. Hinton. Lecture 6.5 - RMSProp: Divide the gradient by a running average of its recent\n\nmagnitude. Coursera Course: Neural Networks for Machine Learning, 4, 2012.\n\n[25] W.C. Woo and W.K. Wong. Application of optical \ufb02ow techniques to rainfall nowcasting. In the 27th\n\nConference on Severe Local Storms, 2014.\n\n[26] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell:\n\nNeural image caption generation with visual attention. In ICML, 2015.\n\n9\n\n\f", "award": [], "sourceid": 521, "authors": [{"given_name": "Xingjian", "family_name": "SHI", "institution": "HKUST"}, {"given_name": "Zhourong", "family_name": "Chen", "institution": "HKUST"}, {"given_name": "Hao", "family_name": "Wang", "institution": "HKUST"}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": "HKUST"}, {"given_name": "Wai-kin", "family_name": "Wong", "institution": null}, {"given_name": "Wang-chun", "family_name": "WOO", "institution": null}]}