{"title": "Semi-supervised Sequence Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3079, "page_last": 3087, "abstract": "We present two approaches to use unlabeled data to improve Sequence Learningwith recurrent networks. The first approach is to predict what comes next in asequence, which is a language model in NLP. The second approach is to use asequence autoencoder, which reads the input sequence into a vector and predictsthe input sequence again. These two algorithms can be used as a \u201cpretraining\u201dalgorithm for a later supervised sequence learning algorithm. In other words, theparameters obtained from the pretraining step can then be used as a starting pointfor other supervised training models. In our experiments, we find that long shortterm memory recurrent networks after pretrained with the two approaches becomemore stable to train and generalize better. With pretraining, we were able toachieve strong performance in many classification tasks, such as text classificationwith IMDB, DBpedia or image recognition in CIFAR-10.", "full_text": "Semi-supervised Sequence Learning\n\nAndrew M. Dai\n\nGoogle Inc.\n\nadai@google.com\n\nQuoc V. Le\nGoogle Inc.\n\nqvl@google.com\n\nAbstract\n\nWe present two approaches to use unlabeled data to improve Sequence Learning\nwith recurrent networks. The \ufb01rst approach is to predict what comes next in a\nsequence, which is a language model in NLP. The second approach is to use a\nsequence autoencoder, which reads the input sequence into a vector and predicts\nthe input sequence again. These two algorithms can be used as a \u201cpretraining\u201d\nalgorithm for a later supervised sequence learning algorithm. In other words, the\nparameters obtained from the pretraining step can then be used as a starting point\nfor other supervised training models. In our experiments, we \ufb01nd that long short\nterm memory recurrent networks after pretrained with the two approaches be-\ncome more stable to train and generalize better. With pretraining, we were able to\nachieve strong performance in many classi\ufb01cation tasks, such as text classi\ufb01cation\nwith IMDB, DBpedia or image recognition in CIFAR-10.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) are powerful tools for modeling sequential data, yet training\nthem by back-propagation through time [37, 27] can be dif\ufb01cult [9]. For that reason, RNNs have\nrarely been used for natural language processing tasks such as text classi\ufb01cation despite their ability\nto preserve word ordering.\n\nOn a variety of document classi\ufb01cation tasks, we \ufb01nd that it is possible to train an LSTM [10] RNN\nto achieve good performance with careful tuning of hyperparameters. We also \ufb01nd that a simple\npretraining step can signi\ufb01cantly stabilize the training of LSTMs. A simple pretraining method is\nto use a recurrent language model as a starting point of the supervised network. A slightly better\nmethod is to use a sequence autoencoder, which uses a RNN to read a long input sequence into\na single vector. This vector will then be used to reconstruct the original sequence. The weights\nobtained from pretraining can then be used as an initialization for the standard LSTM RNNs. We\nbelieve that this semi-supervised approach [1] is superior to other unsupervised sequence learning\nmethods, e.g., Paragraph Vectors [19], because it can allow for easy \ufb01ne-tuning.\n\nIn our experiments with document classi\ufb01cation tasks with 20 Newsgroups [17] and DBpedia [20],\nand sentiment analysis with IMDB [22] and Rotten Tomatoes [26], LSTMs pretrained by recurrent\nlanguage models or sequence autoencoders are usually better than LSTMs initialized randomly.\n\nAnother important result from our experiments is that it is possible to use unlabeled data from re-\nlated tasks to improve the generalization of a subsequent supervised model. For example, using\nunlabeled data from Amazon reviews to pretrain the sequence autoencoders can improve classi\ufb01-\ncation accuracy on Rotten Tomatoes from 79.0% to 83.3%, an equivalence of adding substantially\nmore labeled data. This evidence supports the thesis that it is possible to use unsupervised learning\nwith more unlabeled data to improve supervised learning. With sequence autoencoders, and outside\nunlabeled data, LSTMs are able to match or surpass previously reported results.\n\n1\n\n\fOur semi-supervised learning approach is related to Skip-Thought vectors [14], with two differences.\nThe \ufb01rst difference is that Skip-Thought is a harder objective, because it predicts adjacent sentences.\nThe second is that Skip-Thought is a pure unsupervised learning algorithm, without \ufb01ne-tuning.\n\n2 Sequence autoencoders and recurrent language models\n\nOur approach to sequence autoencoding is inspired by the work in sequence to sequence learning\n(also known as seq2seq) by Sutskever et al. [32], which has been successfully used for machine\ntranslation [21, 11], text parsing [33], image captioning [35], video analysis [31], speech recog-\nnition [4] and conversational modeling [28, 34]. Key to their approach is the use of a recurrent\nnetwork as an encoder to read in an input sequence into a hidden state, which is the input to a\ndecoder recurrent network that predicts the output sequence.\n\nThe sequence autoencoder is similar to the above concept, except that it is an unsupervised learning\nmodel. The objective is to reconstruct the input sequence itself. That means we replace the output\nsequence in the seq2seq framework with the input sequence. In our sequence autoencoders, the\nweights for the decoder network and the encoder network are the same (see Figure 1).\n\nFigure 1: The sequence autoencoder for the sequence \u201cWXYZ\u201d. The sequence autoencoder uses\na recurrent network to read the input sequence in to the hidden state, which can then be used to\nreconstruct the original sequence.\n\nWe \ufb01nd that the weights obtained from the sequence autoencoder can be used as an initialization\nof another supervised network, one which tries to classify the sequence. We hypothesize that this\nis because the network can already memorize the input sequence. This reason, and the fact that the\ngradients have shortcuts, are our hypothesis of why the sequence autoencoder is a good and stable\napproach in initializing recurrent networks.\n\nA signi\ufb01cant property of the sequence autoencoder is that it is unsupervised, and thus can be trained\nwith large quantities of unlabeled data to improve its quality. Our result is that additional unlabeled\ndata can improve the generalization ability of recurrent networks. This is especially useful for tasks\nthat have limited labeled data.\n\nWe also \ufb01nd that recurrent language models [2, 24] can be used as a pretraining method for LSTMs.\nThis is equivalent to removing the encoder part of the sequence autoencoder in Figure 1. Our\nexperimental results show that this approach works better than LSTMs with random initialization.\n\n3 Overview of baselines\n\nIn our experiments, we use LSTM recurrent networks [10] because they are generally better than\nRNNs. Our LSTM implementation is standard and has input gates, forget gates, and output gates [6,\n7, 8]. We compare this basic LSTM against a LSTM initialized with the sequence autoencoder\nmethod. When the LSTM is initialized with a sequence autoencoder, the method is called SA-LSTM\nin our experiments. When LSTM is initialized with a language model, the method is called LM-\nLSTM. We also compare our method to other baselines, e.g., bag-of-words methods or paragraph\nvectors, previously reported on the same datasets.\n\nIn most of our experiments our output layer predicts the document label from the LSTM output\nat the last timestep. We also experiment with the approach of putting the label at every timestep\nand linearly increasing the weights of the prediction objectives from 0 to 1 [25]. This way we can\ninject gradients to earlier steps in the recurrent networks. We call this approach linear label gain.\n\n2\n\n\fLastly, we also experiment with the method of jointly training the supervised learning task with the\nsequence autoencoder and call this method joint training.\n\n4 Experiments\n\nIn our experiments with LSTMs, we follow the basic recipes as described in [7, 32] by clipping the\ncell outputs and gradients. The benchmarks of focus are text understanding tasks, with all datasets\nbeing publicly available. The tasks are sentiment analysis (IMDB and Rotten Tomatoes) and text\nclassi\ufb01cation (20 Newsgroups and DBpedia). Commonly used methods on these datasets, such as\nbag-of-words or n-grams, typically ignore long-range ordering information (e.g., modi\ufb01ers and their\nobjects may be separated by many unrelated words); so one would expect recurrent methods which\npreserve ordering information to perform well. Nevertheless, due to the dif\ufb01culty in optimizing\nthese networks, recurrent models are not the method of choice for document classi\ufb01cation.\n\nIn our experiments with the sequence autoencoder, we train it to reproduce the full document after\nreading all the input words. In other words, we do not perform any truncation or windowing. We\nadd an end of sentence marker to the end of each input sequence and train the network to start\nreproducing the sequence after that marker. To speed up performance and reduce GPU memory\nusage, we perform truncated backpropagation up to 400 timesteps from the end of the sequence. We\npreprocess the text so that punctuation is treated as separate tokens and we ignore any non-English\ncharacters and words in the DBpedia text. We also remove words that only appear once in each\ndataset and do not perform any term weighting or stemming.\n\nAfter training the recurrent language model or the sequence autoencoder for roughly 500K steps\nwith a batch size of 128, we use both the word embedding parameters and the LSTM weights to\ninitialize the LSTM for the supervised task. We then train on that task while \ufb01ne tuning both the\nembedding parameters and the weights and use early stopping when the validation error starts to\nincrease. We choose the dropout parameters based on a validation set.\n\nUsing SA-LSTMs, we are able to match or surpass reported results for all datasets. It is important\nto emphasize that previous best results come from various different methods. So it is signi\ufb01cant\nthat one method achieves strong results for all datasets, presumably because such a method can be\nused as a general model for any similar task. A summary of results in the experiments are shown in\nTable 1. More details of the experiments are as follows.\n\nTable 1: A summary of the error rates of SA-LSTMs and previous best reported results.\n\nDataset\n\nSA-LSTM Previous best result\n\nIMDB\nRotten Tomatoes\n20 Newsgroups\nDBpedia\n\n7.24%\n16.7%\n15.6%\n1.19%\n\n7.42%\n18.5%\n17.1%\n1.74%\n\n4.1 Sentiment analysis experiments with IMDB\n\nIn this \ufb01rst set of experiments, we benchmark our methods on the IMDB movie sentiment dataset,\nproposed by Maas et al. [22].1 There are 25,000 labeled and 50,000 unlabeled documents in the\ntraining set and 25,000 in the test set. We use 15% of the labeled training documents as a validation\nset. The average length of each document is 241 words and the maximum length of a document is\n2,526 words. The previous baselines are bag-of-words, ConvNets [13] or Paragraph Vectors [19].\n\nSince the documents are long, one might expect that it is dif\ufb01cult for recurrent networks to learn. We\nhowever \ufb01nd that with tuning, it is possible to train LSTM recurrent networks to \ufb01t the training set.\nFor example, if we set the size of hidden state to be 512 units and truncate the backprop to be 400,\nan LSTM can do fairly well. With random embedding dimension dropout [38] and random word\ndropout (not published previously), we are able to reach performance of around 86.5% accuracy in\nthe test set, which is approximately 5% worse than most baselines.\n\n1http://ai.Stanford.edu/amaas/data/sentiment/index.html\n\n3\n\n\fFundamentally, the main problem with this approach is that it is unstable: if we were to increase the\nnumber of hidden units or to increase the number of backprop steps, the training breaks down very\nquickly: the objective function explodes even with careful tuning of the gradient clipping. This is\nbecause LSTMs are sensitive to the hyperparameters for long documents. In contrast, we \ufb01nd that\nthe SA-LSTM works better and is more stable. If we use the sequence autoencoders, changing the\nsize of the hidden state or the number of backprop steps hardly affects the training of LSTMs. This\nis important because the models become more practical to train.\n\nUsing sequence autoencoders, we overcome the optimization instability in LSTMs in such a way\nthat it is fast and easy to achieve perfect classi\ufb01cation on the training set. To avoid over\ufb01tting, we\nagain use input dimension dropout, with the dropout rate chosen on a validation set. We \ufb01nd that\ndropping out 80% of the input embedding dimensions works well for this dataset. The results of\nour experiments are shown in Table 2 together with previous baselines. We also add an additional\nbaseline where we initialize a LSTM with word2vec embeddings on the training set.\n\nTable 2: Performance of models on the IMDB sentiment classi\ufb01cation task.\n\nModel\n\nTest error rate\n\nLSTM with tuning and dropout\nLSTM initialized with word2vec embeddings\nLM-LSTM (see Section 2)\nSA-LSTM (see Figure 1)\nSA-LSTM with linear gain (see Section 3)\nSA-LSTM with joint training (see Section 3)\n\nFull+Unlabeled+BoW [22]\nWRRBM + BoW (bnc) [22]\nNBSVM-bi (Na\u00a8\u0131ve Bayes SVM with bigrams) [36]\nseq2-bown-CNN (ConvNet with dynamic pooling) [12]\nParagraph Vectors [19]\n\n13.50%\n10.00%\n7.64%\n7.24%\n9.17%\n14.70%\n\n11.11%\n10.77%\n8.78%\n7.67%\n7.42%\n\nThe results con\ufb01rm that SA-LSTM with input embedding dropout can be as good as previous best\nresults on this dataset. In contrast, LSTMs without sequence autoencoders have trouble in optimiz-\ning the objective because of long range dependencies in the documents.\n\nUsing language modeling (LM-LSTM) as an initialization works well, achieving 8.98%, but less\nwell compared to the SA-LSTM. This is perhaps because language modeling is a short-term objec-\ntive, so that the hidden state only captures the ability to predict the next few words.\n\nIn the above table, we use 1,024 units for memory cells, 512 units for the input embedding layer in\nthe LM-LSTM and SA-LSTM. We also use a hidden layer 30 units with dropout of 50% between the\nlast hidden state and the classi\ufb01er. We continue to use these settings in the following experiments.\n\nIn Table 3, we present some examples from the IMDB dataset that are correctly classi\ufb01ed by SA-\nLSTM but not by a bigram NBSVM model. These examples often have long-term dependencies or\nhave sarcasm that is dif\ufb01cult to detect by solely looking at short phrases.\n\n4.2 Sentiment analysis experiments with Rotten Tomatoes and the positive effects of\n\nadditional unlabeled data\n\nThe success on the IMDB dataset convinces us to test our methods on another sentiment analysis\ntask to see if similar gains can be obtained. The benchmark of focus in this experiment is the Rotten\nTomatoes dataset [26].2 The dataset has 10,662 documents, which are randomly split into 80% for\ntraining, 10% for validation and 10% for test. The average length of each document is 22 words and\nthe maximum length is 52 words. Thus compared to IMDB, this dataset is smaller both in terms of\nthe number of documents and the number of words per document.\n\n2http://www.cs.cornell.edu/people/pabo/movie-review-data/\n\n4\n\n\fTable 3: IMDB sentiment classi\ufb01cation examples that are correctly classi\ufb01ed by SA-LSTM and\nincorrectly by NBSVM-bi.\n\nText\n\nLooking for a REAL super bad movie? If you wanna have great fun, don\u2019t hesitate and\ncheck this one! Ferrigno is incredibly bad but is also the best of this mediocrity.\n\nA professional production with quality actors that simply never touched the heart or the\nfunny bone no matter how hard it tried. The quality cast, stark setting and excellent\ncinemetography made you hope for Fargo or High Plains Drifter but sorry, the soup had\nno seasoning...or meat for that matter. A 3 (of 10) for effort.\n\nSentiment\n\nNegative\n\nNegative\n\nThe screen-play is very bad, but there are some action sequences that i really liked. I\nthink the image is good, better than other romanian movies. I liked also how the actors\ndid their jobs.\n\nNegative\n\nOur \ufb01rst observation is that it is easier to train LSTMs on this dataset than on the IMDB dataset\nand the gaps between LSTMs, LM-LSTMs and SA-LSTMs are smaller than before. This is because\nmovie reviews in Rotten Tomatoes are sentences whereas reviews in IMDB are paragraphs.\n\nAs this dataset is small, our methods tend to severely over\ufb01t the training set. Combining SA-LSTMs\nwith 95% input embedding and 50% word dropout improves generalization and allows the model\nto achieve 19.3% test set error.Tuning the SA-LSTM further on the validation set can improve the\nresult to 19.3% error rate on the test set.\n\nTo better the performance, we add unlabeled data from the IMDB dataset in the previous experiment\nand Amazon movie reviews [23] to the autoencoder training stage.3 We also run a control experiment\nwhere we use the pretrained word vectors trained by word2vec from Google News.\n\nTable 4: Performance of models on the Rotten Tomatoes sentiment classi\ufb01cation task.\n\nModel\n\nTest error rate\n\nLSTM with tuning and dropout\nLM-LSTM\nLSTM with linear gain\nSA-LSTM\n\nLSTM with word vectors from word2vec Google News\nSA-LSTM with unlabeled data from IMDB\nSA-LSTM with unlabeled data from Amazon reviews\n\nMV-RNN [29]\nNBSVM-bi [36]\nCNN-rand [13]\nCNN-non-static (ConvNet with word vectors from word2vec Google News) [13]\n\n20.3%\n21.9%\n22.2%\n19.3%\n\n20.5%\n18.6%\n16.7%\n\n21.0%\n20.6%\n23.5%\n18.5%\n\nThe results for this set of experiments are shown in Table 4. Our observation is that if we use the\nword vectors from word2vec, there is only a small gain of 0.5%. This is perhaps because the recur-\nrent weights play an important role in our model and are not initialized properly in this experiment.\nHowever, if we use IMDB to pretrain the sequence autoencoders, the error decreases from 20.5%\nto 18.6%, nearly a 2% gain in accuracy; if we use Amazon reviews, a larger unlabeled dataset (7.9\nmillion movie reviews), to pretrain the sequence autoencoders, the error goes down to 16.7% which\nis another 2% gain in accuracy.\n\n3The dataset is available at http://snap.stanford.edu/data/web-Amazon.html, which has\n\n34 million general product reviews, but we only use 7.9 million movie reviews in our experiments.\n\n5\n\n\fThis brings us to the question of how well this method of using unlabeled data fares compared to\nadding more labeled data. As argued by Socher et al. [30], a reason of why the methods are not\nperfect yet is the lack of labeled training data, they proposed to use more labeled data by labeling an\naddition of 215,154 phrases created by the Stanford Parser. The use of more labeled data allowed\ntheir method to achieve around 15% error in the test set, an improvement of approximately 5% over\nolder methods with less labeled data.\n\nWe compare our method to their reported results [30] on sentence-level classi\ufb01cation. As our method\ndoes not have access to valuable labeled data, one might expect that our method is severely disad-\nvantaged and should not perform on the same level. However, with unlabeled data and sequence\nautoencoders, we are able to obtain 16.7%, ranking second amongst many other methods that have\naccess to a much larger corpus of labeled data. The fact that unlabeled data can compensate for the\nlack of labeled data is very signi\ufb01cant as unlabeled data are much cheaper than labeled data. The\nresults are shown in Table 5.\n\nTable 5: More unlabeled data vs. more labeled data. Performance of SA-LSTM with additional\nunlabeled data and previous models with additional labeled data on the Rotten Tomatoes task.\n\nModel\n\nTest error rate\n\nLSTM initialized with word2vec embeddings trained on Amazon reviews\nSA-LSTM with unlabeled data from Amazon reviews\n\nNB [30]\nSVM [30]\nBiNB [30]\nVecAvg [30]\nRNN [30]\nMV-RNN [30]\nRNTN [30]\n\n21.7%\n16.7%\n\n18.2%\n20.6%\n16.9%\n19.9%\n17.6%\n17.1%\n14.6%\n\n4.3 Text classi\ufb01cation experiments with 20 newsgroups\n\nThe experiments so far have been done on datasets where the number of tokens in a document is\nrelatively small, a few hundred words. Our question becomes whether it is possible to use SA-\nLSTMs for tasks that have a substantial number of words, such as web articles or emails and where\nthe content consists of many different topics.\n\nFor that purpose, we carry out the next experiments on the 20 newsgroups dataset [17].4 There are\n11,293 documents in the training set and 7,528 in the test set. We use 15% of the training documents\nas a validation set. Each document is an email with an average length of 267 words and a maximum\nlength of 11,925 words. Attachments, PGP keys, duplicates and empty messages are removed. As\nthe newsgroup documents are long, it was previously considered improbable for recurrent networks\nto learn anything from the dataset. The best methods are often simple bag-of-words.\n\nWe repeat the same experiments with LSTMs and SA-LSTMs on this dataset. Similar to obser-\nvations made in previous experiments, SA-LSTMs are generally more stable to train than LSTMs.\nTo improve generalization of the models, we again use input embedding dropout and word dropout\nchosen on the validation set. With 70% input embedding dropout and 75% word dropout, SA-LSTM\nachieves 15.6% test set error which is much better than previous classi\ufb01ers in this dataset. Results\nare shown in Table 6.\n\n4.4 Character-level document classi\ufb01cation experiments with DBpedia\n\nIn this set of experiments, we turn our attention to another challenging task of categorizing\nWikipedia pages by reading character-by-character inputs. The dataset of attention is the DBpedia\ndataset [20], which was also used to benchmark convolutional neural nets in Zhang and LeCun [39].\n\n4http://qwone.com/\u02dcjason/20Newsgroups/\n\n6\n\n\fTable 6: Performance of models on the 20 newsgroups classi\ufb01cation task.\n\nModel\n\nTest error rate\n\nLSTM\nLM-LSTM\nLSTM with linear gain\nSA-LSTM\n\nHybrid Class RBM [18]\nRBM-MLP [5]\nSVM + Bag-of-words [3]\nNa\u00a8\u0131ve Bayes [3]\n\n18.0%\n15.3%\n71.6%\n15.6%\n\n23.8%\n20.5%\n17.1%\n19.0%\n\nNote that unlike other datasets in Zhang and LeCun [39], DBpedia has no duplication or tainting\nissues so we assume that their experimental results are valid on this dataset. DBpedia is a crowd-\nsourced effort to extract information from Wikipedia and categorize it into an ontology.\n\nFor this experiment, we follow the same procedure suggested in Zhang and LeCun [39]. The task is\nto classify DBpedia abstracts into one of 14 categories after reading the character-by-character input.\nThe dataset is split into 560,000 training examples and 70,000 test examples. A DBpedia document\nhas an average of 300 characters while the maximum length of all documents is 13,467 characters.\nAs this dataset is large, over\ufb01tting is not an issue and thus we do not perform any dropout on the\ninput or recurrent layers. For this dataset, we use a two-layered LSTM, each layer has 512 hidden\nunits and and the input embedding has 128 units.\n\nTable 7: Performance of models on the DBpedia character level classi\ufb01cation task.\n\nModel\n\nTest error rate\n\nLSTM\nLM-LSTM\nLSTM with linear gain\nSA-LSTM\nSA-LSTM with linear gain\nSA-LSTM with 3 layers and linear gain\nSA-LSTM (word-level)\n\nBag-of-words\nSmall ConvNet\nLarge ConvNet\n\n13.64%\n1.50%\n1.32%\n2.34%\n1.23%\n1.19%\n1.40%\n\n3.57%\n1.98%\n1.73%\n\nIn this dataset, we \ufb01nd that the linear label gain as described in Section 3 is an effective mechanism to\ninject gradients to earlier steps in LSTMs. This linear gain method works well and achieves 1.32%\ntest set error, which is better than SA-LSTM. Combining SA-LSTM and the linear gain method\nachieves 1.19% test set error, a signi\ufb01cant improvement from the results of convolutional networks\nas shown in Table 7.\n\n4.5 Object classi\ufb01cation experiments with CIFAR-10\n\nIn these experiments, we attempt to see if our pre-training methods extend to non-textual data. To\ndo this, we train a LSTM to read the CIFAR-10 image dataset row-by-row (where the input at\neach timestep is an entire row of pixels) and output the class of the image at the end. We use the\nsame method as in [16] to perform data augmentation. We also trained a LSTM to do next row\nprediction given the current row (we denote this as LM-LSTM) and a LSTM to predict the image\nby rows after reading all its rows (SA-LSTM). We then \ufb01ne-tune these on the classi\ufb01cation task.\nWe present the results in Table 8. While we do not achieve the results attained by state of the\nart convolutional networks, our 2-layer pretrained LM-LSTM is able to exceed the results of the\n\n7\n\n\fbaseline convolutional DBN model [15] despite not using any convolutions and outperforms the non\npre-trained LSTM.\n\nTable 8: Performance of models on the CIFAR-10 object classi\ufb01cation task.\n\nModel\n\nTest error rate\n\n1-layer LSTM\n1-layer LM-LSTM\n1-layer SA-LSTM\n\n2-layer LSTM\n2-layer LM-LSTM\n2-layer SA-LSTM\n\nConvolution DBNs [15]\n\n25.0%\n23.1%\n25.1%\n\n26.0%\n18.7%\n26.0%\n\n21.1%\n\n5 Discussion\n\nIn this paper, we found that it is possible to use LSTM recurrent networks for NLP tasks such as\ndocument classi\ufb01cation. We also \ufb01nd that a language model or a sequence autoencoder can help\nstabilize the learning in recurrent networks. On \ufb01ve benchmarks that we tried, LSTMs can become\na general classi\ufb01er that reaches or surpasses the performance levels of all previous baselines.\n\nAcknowledgements: We thank Oriol Vinyals, Ilya Sutskever, Greg Corrado, Vijay Vasudevan,\nManjunath Kudlur, Rajat Monga, Matthieu Devin, and the Google Brain team for their help.\n\nReferences\n\n[1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks\n\nand unlabeled data. J. Mach. Learn. Res., 6:1817\u20131853, December 2005.\n\n[2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In\n\nJMLR, 2003.\n\n[3] A. Cardoso-Cachopo. Datasets for single-label text categorization. http://web.ist.\n\nutl.pt/acardoso/datasets/, 2015. [Online; accessed 25-May-2015].\n\n[4] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv\n\npreprint arXiv:1508.01211, 2015.\n\n[5] Y. Dauphin and Y. Bengio. Stochastic ratio matching of RBMs for sparse high-dimensional\n\ninputs. In NIPS, 2013.\n\n[6] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with\n\nLSTM. Neural Computation, 2000.\n\n[7] A. Graves. Generating sequences with recurrent neural networks. In Arxiv, 2013.\n\n[8] K. Greff, R. K. Srivastava, J. Koutn\u00b4\u0131k, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search\n\nspace odyssey. In ICML, 2015.\n\n[9] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient \ufb02ow in recurrent nets: the\ndif\ufb01culty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural\nNetworks, 2001.\n\n[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.\n\n[11] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural\n\nmachine translation. In ICML, 2014.\n\n[12] R. Johnson and T. Zhang. Effective use of word order for text categorization with convolutional\n\nneural networks. In NAACL, 2014.\n\n[13] Y. Kim. Convolutional neural networks for sentence classi\ufb01cation, 2014.\n\n8\n\n\f[14] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-\n\nthought vectors. In NIPS, 2015.\n\n[15] A. Krizhevsky. Convolutional deep belief networks on CIFAR-10. Technical report, University\n\nof Toronto, 2010.\n\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[17] K. Lang. Newsweeder: Learning to \ufb01lter netnews. In ICML, 1995.\n\n[18] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio. Learning algorithms for the classi\ufb01ca-\n\ntion restricted boltzmann machine. JMLR, 2012.\n\n[19] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML,\n\n2014.\n\n[20] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann,\nM. Morsey, P. van Kleef, S. Auer, et al. DBpedia \u2013 a large-scale, multilingual knowledge base\nextracted from wikipedia. Semantic Web, 2014.\n\n[21] T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. Addressing the rare word\n\nproblem in neural machine translation. arXiv preprint arXiv:1410.8206, 2014.\n\n[22] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors\n\nfor sentiment analysis. In ACL, 2011.\n\n[23] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimen-\n\nsions with review text. In RecSys, pages 165\u2013172. ACM, 2013.\n\n[24] T. Mikolov, M. Kara\ufb01\u00b4at, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network\n\nbased language model. In INTERSPEECH, 2010.\n\n[25] J. Y. H. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici.\n\nBeyond short snippets: Deep networks for video classi\ufb01cation. In CVPR, 2015.\n\n[26] B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization\n\nwith respect to rating scales. In ACL, 2005.\n\n[27] D. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating\n\nerrors. Nature, 1986.\n\n[28] L. Shang, Z. Lu, and H. Li. Neural responding machine for short-text conversation. In EMNLP,\n\n2015.\n\n[29] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. Semantic compositionality through recur-\n\nsive matrix-vector spaces. In EMNLP, 2012.\n\n[30] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive\n\ndeep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.\n\n[31] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video represen-\n\ntations using LSTMs. In ICML, 2015.\n\n[32] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks.\n\nIn NIPS, 2014.\n\n[33] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign\n\nlanguage. In NIPS, 2015.\n\n[34] O. Vinyals and Q. V. Le. A neural conversational model. In ICML Deep Learning Workshop,\n\n2015.\n\n[35] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption\n\ngenerator. In CVPR, 2014.\n\n[36] S. I. Wang and C. D. Manning. Baselines and bigrams: Simple, good sentiment and topic\n\nclassi\ufb01cation. In ACL, 2012.\n\n[37] P. J. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral\n\nsciences. PhD thesis, Harvard, 1974.\n\n[38] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv\n\npreprint arXiv:1409.2329, 2014.\n\n[39] X. Zhang and Y. LeCun. Character-level convolutional networks for text classi\ufb01cation.\n\nIn\n\nNIPS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1728, "authors": [{"given_name": "Andrew", "family_name": "Dai", "institution": "Google Inc"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}]}