{"title": "Bidirectional Recurrent Neural Networks as Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 856, "page_last": 864, "abstract": "Bidirectional recurrent neural networks (RNN) are trained to predict both in the positive and negative time directions simultaneously. They have not been used commonly in unsupervised tasks, because a probabilistic interpretation of the model has been difficult. Recently, two different frameworks, GSN and NADE, provide a connection between reconstruction and probabilistic modeling, which makes the interpretation possible. As far as we know, neither GSN or NADE have been studied in the context of time series before.As an example of an unsupervised task, we study the problem of filling in gaps in high-dimensional time series with complex dynamics. Although unidirectional RNNs have recently been trained successfully to model such time series, inference in the negative time direction is non-trivial. We propose two probabilistic interpretations of bidirectional RNNs that can be used to reconstruct missing gaps efficiently. Our experiments on text data show that both proposed methods are much more accurate than unidirectional reconstructions, although a bit less accurate than a computationally complex bidirectional Bayesian inference on the unidirectional RNN. We also provide results on music data for which the Bayesian inference is computationally infeasible, demonstrating the scalability of the proposed methods.", "full_text": "Bidirectional Recurrent Neural Networks as\n\nGenerative Models\n\nMathias Berglund\n\nAalto University, Finland\n\nTapani Raiko\n\nAalto University, Finland\n\nMikko Honkala\n\nNokia Labs, Finland\n\nLeo K\u00a8arkk\u00a8ainen\nNokia Labs, Finland\n\nAkos Vetek\n\nNokia Labs, Finland\n\nJuha Karhunen\n\nAalto University, Finland\n\nAbstract\n\nBidirectional recurrent neural networks (RNN) are trained to predict both in the\npositive and negative time directions simultaneously. They have not been used\ncommonly in unsupervised tasks, because a probabilistic interpretation of the\nmodel has been dif\ufb01cult. Recently, two different frameworks, GSN and NADE,\nprovide a connection between reconstruction and probabilistic modeling, which\nmakes the interpretation possible. As far as we know, neither GSN or NADE\nhave been studied in the context of time series before. As an example of an un-\nsupervised task, we study the problem of \ufb01lling in gaps in high-dimensional time\nseries with complex dynamics. Although unidirectional RNNs have recently been\ntrained successfully to model such time series, inference in the negative time di-\nrection is non-trivial. We propose two probabilistic interpretations of bidirectional\nRNNs that can be used to reconstruct missing gaps ef\ufb01ciently. Our experiments on\ntext data show that both proposed methods are much more accurate than unidirec-\ntional reconstructions, although a bit less accurate than a computationally complex\nbidirectional Bayesian inference on the unidirectional RNN. We also provide re-\nsults on music data for which the Bayesian inference is computationally infeasible,\ndemonstrating the scalability of the proposed methods.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNN) have recently been trained successfully for time series model-\ning, and have been used to achieve state-of-the-art results in supervised tasks including handwriting\nrecognition [12] and speech recognition [13]. RNNs have also been used successfully in unsuper-\nvised learning of time series [26, 8].\nRecently, RNNs have also been used to generate sequential data [1] in a machine translation context,\nwhich further emphasizes the unsupervised setting. Bahdanau et al. [1] used a bidirectional RNN\nto encode a phrase into a vector, but settled for a unidirectional RNN to decode it into a translated\nphrase, perhaps because bidirectional RNNs have not been studied much as generative models. Even\nmore recently, Maas et al. [18] used a deep bidirectional RNN in speech recognition, generating text\nas output.\nMissing value reconstruction is interesting in at least three different senses. Firstly, it can be used to\ncope with data that really has missing values. Secondly, reconstruction performance of arti\ufb01cially\nmissing values can be used as a measure of performance in unsupervised learning [21]. Thirdly,\nreconstruction of arti\ufb01cially missing values can be used as a training criterion [9, 11, 27]. While\ntraditional RNN training criterions correspond to one-step prediction, training to reconstruct longer\ngaps can push the model towards concentrating on longer-term predictions. Note that the one-step\n\n1\n\n\fFigure 1: Structure of the simple RNN (left) and the bidirectional RNN (right).\n\nprediction criterion is typically used even in approaches that otherwise concentrate on modelling\nlong-term dependencies [see e.g. 19, 17].\nWhen using unidirectional RNNs as generative models, it is straightforward to draw samples from\nthe model in sequential order. However, inference is not trivial in smoothing tasks, where we want to\nevaluate probabilities for missing values in the middle of a time series. For discrete data, inference\nwith gap sizes of one is feasible - however, inference with larger gap sizes becomes exponentially\nmore expensive. Even sampling can be exponentially expensive with respect to the gap size.\nOne strategy used for training models that are used for \ufb01lling in gaps is to explicitly train the model\nwith missing data [see e.g. 9]. However, such a criterion has not to our knowledge yet been used and\nthoroughly evaluated compared with other inference strategies for RNNs.\nIn this paper, we compare different methods of using RNNs to infer missing values for binary\ntime series data. We evaluate the performance of two generative models that rely on bidirectional\nRNNs, and compare them to inference using a unidirectional RNN. The proposed methods are very\nfavourable in terms of scalability.\n\n2 Recurrent Neural Networks\n\nRecurrent neural networks [24, 14] can be seen as extensions of the standard feedforward multilayer\nperceptron networks, where the inputs and outputs are sequences instead of individual observations.\nLet us denote the input to a recurrent neural network by X = {xt} where xt 2 RN is an input\nvector for each time step t. Let us further denote the output as Y = {yt} where yt 2 RM is an\noutput vector for each time step t. Our goal is to model the distribution P (Y|X). Although RNNs\nmap input sequences to output sequences, we can use them in an unsupervised manner by letting the\nRNN predict the next input. We can do so by setting Y = {yt = xt+1}.\n2.1 Unidirectional Recurrent Neural Networks\n\nThe structure of a basic RNN with one hidden layer is illustrated in Figure 1, where the output yt is\ndetermined by\n\nwhere\n\nPyt | {xd}t\n\nd=1 =  (Wyht + by)\n\nht = tanh (Whht1 + Wxxt + bh)\n\n(1)\n\n(2)\n\nand Wy, Wh, and Wx are the weight matrices connecting the hidden to output layer, hidden to\nhidden layer, and input to hidden layer, respectively. by and bh are the output and hidden layer\nbias vectors, respectively. Typical options for the \ufb01nal nonlinearity  are the softmax function\nfor classi\ufb01cation or categorical prediction tasks, or independent Bernoulli variables with sigmoid\nfunctions for other binary prediction tasks. In this form, the RNN therefore evaluates the output yt\nbased on information propagated through the hidden layer that directly or indirectly depends on the\nobservations {xd}t\n\nd=1 = {x1, . . . , xt}.\n\n2\n\n\f2.2 Bidirectional Recurrent Neural Networks\n\nBidirectional RNNs (BRNN) [25, 2] extend the unidirectional RNN by introducing a second hid-\nden layer, where the hidden to hidden connections \ufb02ow in opposite temporal order. The model is\ntherefore able to exploit information both from the past and the future.\nThe output yt is traditionally determined by\n\nbut we propose the use of\n\nwhere\n\nP (yt | {xd}d6=t) = Wf\nP (yt | {xd}d6=t) = Wf\nt = tanhWf\nt = tanhWb\n\nhhf\nhhb\n\nhf\nhb\n\nyhf\n\nt + Wb\n\nyhb\n\nyhf\n\nt1 + Wb\n\nyhb\n\nt1 + Wf\nt+1 + Wb\n\nxxt + bf\nxxt + bb\n\nt + by ,\nt+1 + by\nh\nh .\n\n(3)\n\n(4)\n(5)\n\nThe structure of the BRNN is illustrated in Figure 1 (right). Compared with the regular RNN,\nthe forward and backward directions have separate non-tied weights and hidden activations, and are\ndenoted by the superscript f and b for forward and backward, respectively. Note that the connections\nare acyclic. Note also that in the proposed formulation, yt does not get information from xt. We\ncan therefore use the model in an unsupervised manner to predict one time step given all other time\nsteps in the input sequence simply by setting Y = X.\n\n3 Probabilistic Interpretation for Unsupervised Modelling\n\nProbabilistic unsupervised modeling for sequences using a unidirectional RNN is straightforward,\nas the joint distribution for the whole sequence is simply the product of the individual predictions:\n\nPunidirectional(X) =\n\nP (xt | {xd}t1\nd=1).\n\n(6)\n\nTYt=1\n\nFor the BRNN, the situation is more complicated. The network gives predictions for individual\noutputs given all the others, and the joint distribution cannot be written as their product. We propose\ntwo solutions for this, denoted by GSN and NADE.\nGSN Generative Stochastic Networks (GSN) [6] use a denoising auto-encoder to estimate the data\ndistribution as the asymptotic distribution of the Markov chain that alternates between corruption\nand denoising. The resulting distribution is thus de\ufb01ned only implicitly, and cannot be written\nanalytically. We can de\ufb01ne a corruption function that masks xt as missing, and a denoising function\nthat reconstructs it from the others. It turns out that one feedforward pass of the BRNN does exactly\nthat.\nOur \ufb01rst probabilistic interpretation is thus that the joint distribution de\ufb01ned by a BRNN is the\nasymptotic distribution of a process that replaces one observation vector xt at a time in X by sam-\npling it from PBRNN(xt | {xd}d6=t). In practice, we will start from a random initialization and use\nGibbs sampling.\nNADE The Neural Autoregressive Distribution Estimator (NADE) [27] de\ufb01nes a probabilistic model\nby reconstructing missing components of a vector one at a time in a random order, starting from a\nfully unobserved vector. Each reconstruction is given by an auto-encoder network that takes as input\nthe observations so far and an auxiliary mask vector that indicates which values are missing.\nWe extend the same idea for time series. Firstly, we concatenate an auxiliary binary element to\ninput vectors to indicate a missing input. The joint distribution of the time series is de\ufb01ned by \ufb01rst\ndrawing a random permutation od of time indices 1 . . . T and then setting data points observed one\nby one in that order, starting from a fully missing sequence:\n\nPNADE(X | od) =\n\nP (xod | {xoe}d1\ne=1).\n\n(7)\n\nIn practice, the BRNN will be trained with some inputs marked as missing, while all the outputs are\nobserved. See Section 5.1 for more training details.\n\nTYd=1\n\n3\n\n\f4 Filling in gaps with Recurrent Neural Networks\n\nThe task we aim to solve is to \ufb01ll in gaps of multiple consecutive data points in high-dimensional\nbinary time series data. The inference is not trivial for two reasons: \ufb01rstly, we reconstruct multiple\nconsecutive data points, which are likely to depend on each other, and secondly, we \ufb01ll in data in the\nmiddle of a time series and hence need to consider the data both before and after the gap.\nFor \ufb01lling in gaps with the GSN approach, we \ufb01rst train a bidirectional RNN to estimate PBRNN(xt |\n{xd}d6=t). In order to achieve that, we use the structure presented in Section 2.2. At test time,\nthe gap is \ufb01rst initialized to random values, after which the missing values are sampled from the\ndistribution PBRNN(xt | {xd}d6=t) one by one in a random order repeatedly to approximate the\nstationary distribution. For the RNN structures used in this paper, the computational complexity of\nthis approach at test time is O((dc + c2)(T + gM )) where d is the dimensionality of a data point, c\nis the number of hidden units in the RNN, T is the number of time steps in the data, g is the length\nof the gap and M is the number of Markov chain Monte Carlo (MCMC) steps used for inference.\nFor \ufb01lling in gaps with the NADE approach, we \ufb01rst train a bidirectional RNN where some of the\ninputs are set to a separate missing value token. At test time, all data points in the gap are \ufb01rst\ninitialized with this token, after which each missing data point is reconstructed once until the whole\ngap is \ufb01lled. Computationally, the main difference to GSN is that we do not have to sample each\nreconstructed data point multiple times, but the reconstruction is done in as many steps as there\nare missing data points in the gap. For the RNN structures used in this paper, the computational\ncomplexity of this approach at test time is O((dc + c2)(T + g)) where d is the dimensionality of a\ndata point, c is the number of hidden units in the RNN, g is the length of the gap and T is the number\nof time steps in the data.\nIn addition to the two proposed methods, one can use a unidirectional RNN to solve the same task.\nWe call this method Bayesian MCMC. Using a unidirectional RNN for the task of \ufb01lling in gaps is\nnot trivial, as we need to take into account the probabilities of the values after the gap, which the\nmodel does not explicitly do. We therefore resort to a similar approach as the GSN approach, where\nwe replace the PBRNN(xt | {xd}d6=t) with a unidirectional equivalent for the Gibbs sampling. As\nthe unidirectional RNN models conditional probabilities of the form PRNN(xt | {xd}t1\nd=1), we can\nuse Bayes\u2019 theorem to derive:\n\nPRNN (xt = a | {xd}d6=t)\n\n(8)\n(9)\n\n(10)\n\n=\n\nd=1\ne=t+1 | xt = a,{xd}t1\n\n/ PRNNxt = a | {xd}t1\nd=1 PRNN{xe}T\nd=1)xt=a\nTY\u2327 =t\nPRNN(x\u2327 | {xd}\u23271\nwhere PRNN(x\u2327 | {xd}\u23271\nd=1) is directly the output of the unidirectional RNN given an input se-\nquence X, where one time step t, i.e. the one we Gibbs sample, is replaced by a proposal a. The\nproblem is that we have to go through all possible proposals a separately to evaluate the probability\nP (xt = a|{xd}d6=t). We therefore have to evaluate the product of the outputs of the unidirectional\nRNN for time steps t . . . T for each possible a.\nIn some cases this is feasible to evaluate. For categorical data, e.g. text, there are as many possible\nvalues for a as there are dimensions1. However, for other binary data the number of possibilities\ngrows exponentially, and is clearly not feasible to evaluate. For the RNN structures used in this\npaper, the computational complexity of this approach at test time is O((dc + c2)(T + aT M )) where\na is the number of different values a data point can have, d is the dimensionality of a data point,\nc is the number of hidden units in the RNN, T is the number of time steps in the data, and M is\nthe number of MCMC steps used for inference. The critical difference in complexity to the GSN\napproach is the coef\ufb01cient a, that for categorical data takes the value d, for binary vectors 2d and for\ncontinuous data is in\ufb01nite.\nAs a simple baseline model, we also evaluate the one-gram log-likelihood of the gaps. The one-gram\nmodel assumes a constant context-independent categorical distribution for the categorical task, or a\n\n1For character-based text, the number of dimensions is the number of characters in the model alphabet.\n\n4\n\n\fvector of factorial binomial probabilities for the structured prediction task:\n\nPonegram (yt) = f (by) .\n\nThis can be done in O(dg).\nWe also compare to one-way inference, where the data points in the gap are reconstructed in order\nwithout taking the future context into account, using Equations (1) and (2) directly. The computa-\ntional complexity is O((dc + c2)T ).\n5 Experiments\n\nWe run two sets of experiments: one for a categorical prediction task, and one for a binary structured\nprediction task. In the categorical prediction task we \ufb01ll in gaps of \ufb01ve characters in Wikipedia text,\nwhile in the structural prediction task we \ufb01ll in gaps of \ufb01ve time steps in different polyphonic music\ndata sets.\n\n5.1 Training details for categorical prediction task\n\nFor the categorical prediction task, we test the performance of the two proposed methods, GSN and\nNADE. In addition, we compare the performance to MCMC using Bayesian inference and one-way\ninference with a unidirectional RNN. We therefore have to train three different RNNs, one for each\nmethod.\nEach RNN is trained as a predictor network, where the character at each step is predicted based\non all the previous characters (in the case of the RNN) or all the previous and following characters\n(in the case of the BRNNs). We use the same data set as Sutskever et al. [26], which consists of\n2GB of English text from Wikipedia. For training, we follow a similar strategy as Hermans and\nSchrauwen [15]. The characters are encoded as one-hot binary vectors with a dimensionality of\nd = 96 characters and the output is modelled with a softmax distribution. We train the unirectional\nRNN with string lengths of T = 250 characters, where the error is propagated only from the last 200\noutputs. In the BRNN we use string length of T = 300 characters, where the error is propagated\nfrom the middle 200 outputs. We therefore avoid propagating the gradient from predictions that lack\nlong temporal context.\nFor the BRNN used in the NADE method, we add one dimension to the one-hot input which cor-\nresponds to a missing value token. During training, in each minibatch we mark g = 5 consecutive\ncharacters every 25 time steps as a gap. During training, the error is propagated only from these\ngaps. For each gap, we uniformly draw a value from 1 to 5, and set that many characters in the gap\nto the missing value token. The model is therefore trained to predict the output in different stages of\ninference, where a number of the inputs are still marked as missing. For comparison, we also train\na similar network, but without masking. In that variant, the error is therefore propagated from all\ntime steps. We refer to \u201cNADE\u201d masked and \u201cNADE no mask\u201d, respectively, for these two training\nmethods.\nFor all the models, the weight elements are drawn from the uniform distribution: wi,j \u21e0U [s, s]\nwhere s = 1 for the input to hidden layer, and following Glorot and Bengio [10], where s =\np6/ (din + dout) for the hidden-to-hidden and the hidden-to output layers. The biases are initial-\n\nized to zero.\nWe use c = 1000 hidden units in the unidirectional RNN and c = 684 hidden units in the two hidden\nlayers in the BRNNs. The number of parameters in the two model types is therefore roughly the\nsame. In the recurrent layers, we set the recurrent activation connected to the \ufb01rst time step to zero.\nThe networks are trained using stochastic gradient descent and the gradient is calculated using back-\npropagation through time. We use a minibatch size of 40, i.e. each minibatch consists of 40 ran-\ndomly sampled sequences of length 250. As the gradients tend to occasionally \u201cblow up\u201d when\ntraining RNNs [5, 20], we normalize the gradients at each update to have length one. The step size\nis set to 0.25 for all layers in the beginning of training, and it is linearly decayed to zero during\ntraining. As training the model is very time-consuming2, we do not optimize the hyperparameters,\nor repeat runs to get con\ufb01dence intervals around the evaluated performances.\n\n2We used about 8 weeks of GPU time for the reported results.\n\n5\n\n\f5.2 Training Details for the Binary Structured Prediction Task\n\nIn the other set of experiments, we use four polyphonic music data sets [8]. The data sets consist of at\nleast 7 hours of polyphonic music each, where each data point is a binary d = 88-dimensional vector\nthat represents one time step of MIDI-encoded music, indicating which of the 88 keys of a piano are\npressed. We test the performance of the two proposed methods, but omit training the unidirectional\nRNNs as the computational complexity of the Bayesian MCMC is prohibitive (a = 288).\nWe train all models for 50 000 updates in minibatches of \u21e1 3 000 individual data points3. As the\ndata sets are small, we select the initial learning rate on a grid of {0.0001, 0.0003, . . . , 0.3, 1} based\non the lowest validation set cost. We use no \u201cburn-in\u201d as several of the scores are fairly short, and\ntherefore do not speci\ufb01cally mask out values in the beginning or end of the data set as we did for the\ntext data.\nFor the NADE method, we use an additional dimension as a missing value token in the data. For the\nmissing values, we set the missing value token to one and the other dimensions to zero.\nOther training details are similar to the categorical prediction task.\n\n5.3 Evaluation of Models\n\nAt test time, we evaluate the models by calculating the mean log-likelihood of the correct value of\ngaps of \ufb01ve consecutive missing values in test data.\nIn the GSN and Bayesian MCMC approaches, we \ufb01rst set the \ufb01ve values in the gap to a random value\nfor the categorical prediction task, or to zero for the structured prediction task. We then sample all\n\ufb01ve values in the gap in random order, and repeat the procedure for M = 100 MCMC steps4. For\nevaluating the log-likelihood of the correct value for the string, we force the last \ufb01ve steps to sample\nthe correct value, and store the probability of the model sampling those values. We also evaluate\nthe probability of reconstructing correctly the individual data points by not forcing the last \ufb01ve time\nsteps to sample the correct value, but by storing the probability of reconstructing the correct value\nfor each data point separately. We run the MCMC chain 100 times and use the log of the mean of\nthe likelihoods of predicting the correct value over these 100 runs.\nWhen evaluating the performance of one-directional inference, we use a similar approach to MCMC.\nHowever, when evaluating the log-likelihood of the entire gap, we only construct it once in sequen-\ntial order, and record the probabilities of reconstructing the correct value. When evaluating the prob-\nability of reconstructing the correct value for each data point separately, we use the same approach\nas for MCMC and sample the gap 100 times, recording for each step the probability of sampling the\ncorrect value. The result for each data point is the log of the mean of the likelihoods over these 100\nruns.\nOn the Wikipedia data, we evaluate the GSN and NADE methods on 50 000 gaps on the test data.\nOn the music data, all models are evaluated on all possible gaps of g = 5 on the test data, excluding\ngaps that intersect with the \ufb01rst and last 10 time steps of a score. When evaluating the Bayesian\nMCMC with the unidirectional RNN, we have to signi\ufb01cantly limit the size of the data set, as the\nmethod is highly computationally complex. We therefore run it on 1 000 gaps on the test data.\nFor NADE, we set the \ufb01ve time steps in the gap to the missing value token. We then reconstruct\nthem one by one to the correct value, and record the probability of the correct reconstruction. We\nrepeat this process for all possible permutations of the order in which to do the reconstruction, and\ntherefore acquire the exact probability of the correct reconstruction given the model and the data.\nWe also evaluate the individual character reconstruction probabilities by recording the probability\nof sampling the correct value given all other values in the gap are set to missing.\n\n5.4 Results\n\nFrom Table 1 we can see that the Bayesian MCMC method seems to yield the best results, while\nGSN or NADE outperform one-way inference. It is worth noting that in the most dif\ufb01cult data sets,\n\n3A minibatch can therefore consist of e.g. 100 musical scores, each of length T = 30.\n4M = 100 MCMC steps means that each value in the gap of g = 5 will be resampled M/g = 20 times\n\n6\n\n\fWikipedia\n\nNottingham\n\nTable 1: Negative Log Likelihood (NLL) for gaps of \ufb01ve time steps using different models (lower\nis better). In the experiments, GSN and NADE perform well, although they are outperformed by\nBayesian MCMC.\nInference strategy\nGSN\nNADE masked\nNADE\nBayesian MCMC\nOne-way inference\nOne-gram\n\nPiano\n38.8\n40.4\n39.4\nNA\n38.9\n138\n\nMuse\n37.3\n36.5\n34.7\nNA\n37.6\n147\n\nJSB\n43.8\n44.3\n44.6\nNA\n43.9\n118\n\n19.1\n19.0\n18.5\nNA\n19.2\n145\n\n4.60\n4.86\n4.88\n4.37\n5.79\n23.3\n\nGSN\nNADE\nBayesian MCMC\nOne-way inference\n\n2.8\n\n2.6\n\n2.4\n\n2.2\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\nL\nL\nN\n\ni\n\n \nt\nn\no\np\n \na\nt\na\nD\n\n10\n\n9.5\n\n9\n\n8.5\n\n8\n\nL\nL\nN\n\ni\n\n \nt\nn\no\np\n \na\nt\na\nD\n\nGSN\nNADE\nOne-way inference\n\n1\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\nPosition in gap\n\n3.5\n\n4\n\n4.5\n\n5\n\n7.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\nPosition in gap\n\n3.5\n\n4\n\n4.5\n\n5\n\nFigure 2: Average NLL per data point using different methods with the Wikipedia data set (left)\nand the Piano data set (right) for different positions in a gap of 5 consecutive missing values. The\nmiddle data point is the most dif\ufb01cult to estimate for the most methods, while the one-way inference\ncannot take future context into account making prediction of later positions dif\ufb01cult. For the left-\nmost position in the gap, the one-way inference performs the best since it does not require any\napproximations such as MCMC.\n\npiano and JSB, oneway inference performs very well. Qualitative examples of the reconstructions\nobtained with the GSN and NADE on the Wikipedia data are shown in Table 3 (supplementary\nmaterial).\nIn order to get an indication of how the number of MCMC steps in the GSN approach affects\nperformance, we plotted the difference in NLL of GSN and NADE of the test set as a function\nof the number of MCMC steps in Figure 3 (supplementary material). The \ufb01gure indicates that the\nmusic data sets mix fairly well, as the performance of GSN quickly saturates. However, for the\nWikipedia data, the performance could probably be even further improved by letting the MCMC\nchain run for more than M = 100 steps.\nIn Figure 2 we have evaluated the NLL for the individual characters in the gaps of length \ufb01ve. As\nexpected, all methods except for one-way inference are better at predicting characters close to both\nedges of the gap.\nAs a sanity check, we make sure our models have been successfully trained by evaluating the mean\ntest log-likelihood of the BRNNs for gap sizes of one. In Table 2 (supplementary material) we can\nsee that the BRNNs expectedly outperform previously published results with unidirectional RNNs,\nwhich indicates that the models have been trained successfully.\n\n6 Conclusion and Discussion\n\nAlthough recurrent neural networks have been used as generative models for time series data, it has\nnot been trivial how to use them for inference in cases such as missing gaps in the sequential data.\n\n7\n\n\fIn this paper, we proposed to use bidirectional RNNs as generative models for time series, with\ntwo probabilistic interpretations called GSN and NADE. Both provide ef\ufb01cient inference in both\npositive and negative directions in time, and both can be used in tasks where Bayesian inference of\na unidirectional RNN is computationally infeasible.\nThe model we trained for NADE differed from the basic BRNN in several ways: Firstly, we arti\ufb01-\ncially marked gaps of 5 consecutive points as missing, which should help in specializing the model\nfor such reconstruction tasks. It would be interesting to study the effect of the missingness pattern\nused in training, on the learned representations and predictions. Secondly, in addition to using all\noutputs as the training signal, we tested using only the reconstructions of those missing values as the\ntraining signal. This reduces the effective amount of training that the model went through. Thirdly,\nthe model had one more input (the missingness indicator) that makes the learning task more dif\ufb01cult.\nWe can see from Table 2 that the model we trained for NADE where we only used the reconstruc-\ntions as the training signal has a worse performance than the BRNN for reconstructing single values.\nThis indicates that these differences in training have a signi\ufb01cant impact on the quality of the \ufb01nal\ntrained probabilistic model.\nWe used the same number of parameters when training an RNN and a BRNN. The RNN can concen-\ntrate all the learning effort on forward prediction, and re-use the learned dependencies in backward\ninference by the computationally heavy Bayesian inference.\nIt remains an open question which\napproach would work best given an optimal size of the hidden layers.\nAs future work, other model structures could be explored in this context, for instance the Long Short-\nTerm Memory [16]. Speci\ufb01cally to our NADE approach, it might make sense to replace the regular\nadditive connection from the missingness indicator input to the hidden activations in Eq. (4,5), by\na multiplicative connection that somehow gates the dynamics mappings Wf\nh. Another\ndirection to extend is to use a deep architecture with more hidden layers.\nThe midi music data is an example of a structured prediction task: Components of the output vector\ndepend strongly on each other. However, our model assumes independent Bernoulli distributions\nfor them. One way to take those dependencies into account is to use stochastic hidden units hf\nt and\nt, which has been shown to improve performance on structured prediction tasks [22]. Bayer and\nhb\nOsendorfer [4] explored that approach, and reconstructed missing values in the middle of motion\ncapture data. In their reconstruction method, the hidden stochastic variables are selected based on\nan auxiliary inference model, after which the missing values are reconstructed conditioned on the\nhidden stochastic variable values. Both steps are done with maximum a posteriori point selection\ninstead of sampling. Further quantitative evaluation of the method would be an interesting point of\ncomparison.\nThe proposed methods could be easily extended to continuous-valued data. As an example appli-\ncation, time-series reconstructions with a recurrent model has been shown to be effective in speech\nrecognition especially under impulsive noise [23].\n\nh and Wb\n\nAcknowledgements\n\nWe thank KyungHyun Cho and Yoshua Bengio for useful discussions. The software for the simu-\nlations for this paper was based on Theano [3, 7]. Nokia has supported Mathias Berglund and the\nAcademy of Finland has supported Tapani Raiko.\n\nReferences\n[1] Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and\n\ntranslate. In Proceedings of the International Conference on Learning Representations (ICLR 2015).\n\n[2] Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the past and the future in\n\nprotein secondary structure prediction. Bioinformatics, 15(11), 937\u2013946.\n\n[3] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and\nBengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised\nFeature Learning NIPS 2012 Workshop.\n\n[4] Bayer, J. and Osendorfer, C. (2014).\n\narXiv:1411.7610.\n\nLearning stochastic recurrent networks.\n\narXiv preprint\n\n8\n\n\f[5] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is\n\ndif\ufb01cult. IEEE Transactions on Neural Networks, 5(2), 157\u2013166.\n\n[6] Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013). Generalized denoising auto-encoders as generative\n\nmodels. In Advances in Neural Information Processing Systems, pages 899\u2013907.\n\n[7] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,\nIn Proceedings of the\n\nD., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.\nPython for Scienti\ufb01c Computing Conference (SciPy 2010). Oral Presentation.\n\n[8] Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal dependencies in\nhigh-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings\nof the 29th International Conference on Machine Learning (ICML 2012), pages 1159\u20131166.\n\n[9] Brakel, P., Stroobandt, D., and Schrauwen, B. (2013). Training energy-based models for time-series impu-\n\ntation. The Journal of Machine Learning Research, 14(1), 2771\u20132797.\n\n[10] Glorot, X. and Bengio, Y. (2010). Understanding the dif\ufb01culty of training deep feedforward neural net-\n\nworks. In International conference on arti\ufb01cial intelligence and statistics, pages 249\u2013256.\n\n[11] Goodfellow, I., Mirza, M., Courville, A., and Bengio, Y. (2013). Multi-prediction deep boltzmann ma-\n\nchines. In Advances in Neural Information Processing Systems, pages 548\u2013556.\n\n[12] Graves, A., Liwicki, M., Fern\u00b4andez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A novel\nIEEE Transactions on Pattern Analysis\n\nconnectionist system for unconstrained handwriting recognition.\nand Machine Intelligence, 31(5), 855\u2013868.\n\n[13] Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recurrent neural\n\nnetworks. arXiv preprint arXiv:1303.5778.\n\n[14] Haykin, S. (2009). Neural networks and learning machines, volume 3. Pearson Education.\n[15] Hermans, M. and Schrauwen, B. (2013). Training and analysing deep recurrent neural networks.\n\nIn\nC. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Infor-\nmation Processing Systems 26, pages 190\u2013198. Curran Associates, Inc.\n\n[16] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735\u2013\n\n1780.\n\n[17] Koutn\u00b4\u0131k, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In Proceedings of the\n\n31 st International Conference on Machine Learning.\n\n[18] Maas, A. L., Hannun, A. Y., Jurafsky, D., and Ng, A. Y. (2014). First-pass large vocabulary continuous\n\nspeech recognition using bi-directional recurrent dnns. arXiv preprint arXiv:1408.2873.\n\n[19] Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., and Ranzato, M. (2014). Learning longer memory in\n\nrecurrent neural networks. arXiv preprint arXiv:1412.7753.\n\n[20] Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the dif\ufb01culty of training recurrent neural networks.\nIn Proceedings of the 30th International Conference on Machine Learning (ICML 2013), pages 1310\u20131318.\n[21] Raiko, T. and Valpola, H. (2001). Missing values in nonlinear factor analysis. In Proc. of the 8th Int.\n\nConf. on Neural Information Processing (ICONIP01),(Shanghai), pages 822\u2013827.\n\n[22] Raiko, T., Berglund, M., Alain, G., and Dinh, L. (2015). Techniques for learning binary stochastic\nfeedforward neural networks. In International Conference on Learning Representations (ICLR 2015), San\nDiego.\n\n[23] Remes, U., Palom\u00a8aki, K., Raiko, T., Honkela, A., and Kurimo, M. (2011). Missing-feature reconstruction\n\nwith a bounded nonlinear state-space model. IEEE Signal Processing Letters, 18(10), 563\u2013566.\n\n[24] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating\n\nerrors. Nature, 323, 533\u2013536.\n\n[25] Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on\n\nSignal Processing, 45(11), 2673\u20132681.\n\n[26] Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent neural networks. In\n\nProceedings of the 28th International Conference on Machine Learning (ICML 2011), pages 1017\u20131024.\n\n[27] Uria, B., Murray, I., and Larochelle, H. (2014). A deep and tractable density estimator. In Proceedings of\n\nThe 31st International Conference on Machine Learning, pages 467\u2013475.\n\n9\n\n\f", "award": [], "sourceid": 547, "authors": [{"given_name": "Mathias", "family_name": "Berglund", "institution": "Aalto University"}, {"given_name": "Tapani", "family_name": "Raiko", "institution": "Aalto University, The Curious AI Company"}, {"given_name": "Mikko", "family_name": "Honkala", "institution": "Nokia Labs"}, {"given_name": "Leo", "family_name": "K\u00e4rkk\u00e4inen", "institution": "Nokia Labs"}, {"given_name": "Akos", "family_name": "Vetek", "institution": "Nokia Labs"}, {"given_name": "Juha", "family_name": "Karhunen", "institution": "Aalto University"}]}