{"title": "Grammar as a Foreign Language", "book": "Advances in Neural Information Processing Systems", "page_first": 2773, "page_last": 2781, "abstract": "Syntactic constituency parsing is a fundamental problem in naturallanguage processing which has been the subject of intensive researchand engineering for decades.  As a result, the most accurate parsersare domain specific, complex, and inefficient.  In this paper we showthat the domain agnostic attention-enhanced sequence-to-sequence modelachieves state-of-the-art results on the most widely used syntacticconstituency parsing dataset, when trained on a large synthetic corpusthat was annotated using existing parsers.  It also matches theperformance of standard parsers when trained on a smallhuman-annotated dataset, which shows that this model is highlydata-efficient, in contrast to sequence-to-sequence models without theattention mechanism.  Our parser is also fast, processing over ahundred sentences per second with an unoptimized CPU implementation.", "full_text": "Grammar as a Foreign Language\n\nOriol Vinyals\u2217\n\nGoogle\n\nvinyals@google.com\n\nLukasz Kaiser\u2217\n\nGoogle\n\nlukaszkaiser@google.com\n\nTerry Koo\n\nGoogle\n\nterrykoo@google.com\n\nSlav Petrov\n\nGoogle\n\nslav@google.com\n\nIlya Sutskever\n\nGoogle\n\nilyasu@google.com\n\nGeoffrey Hinton\n\nGoogle\n\ngeoffhinton@google.com\n\nAbstract\n\nSyntactic constituency parsing is a fundamental problem in natural language pro-\ncessing and has been the subject of intensive research and engineering for decades.\nAs a result, the most accurate parsers are domain speci\ufb01c, complex, and in-\nef\ufb01cient.\nIn this paper we show that the domain agnostic attention-enhanced\nsequence-to-sequence model achieves state-of-the-art results on the most widely\nused syntactic constituency parsing dataset, when trained on a large synthetic cor-\npus that was annotated using existing parsers. It also matches the performance\nof standard parsers when trained only on a small human-annotated dataset, which\nshows that this model is highly data-ef\ufb01cient, in contrast to sequence-to-sequence\nmodels without the attention mechanism. Our parser is also fast, processing over\na hundred sentences per second with an unoptimized CPU implementation.\n\n1\n\nIntroduction\n\nSyntactic constituency parsing is a fundamental problem in linguistics and natural language pro-\ncessing that has a wide range of applications. This problem has been the subject of intense research\nfor decades, and as a result, there exist highly accurate domain-speci\ufb01c parsers. The computational\nrequirements of traditional parsers are cubic in sentence length, and while linear-time shift-reduce\nconstituency parsers improved in accuracy in recent years, they never matched state-of-the-art. Fur-\nthermore, standard parsers have been designed with parsing in mind; the concept of a parse tree is\ndeeply ingrained into these systems, which makes these methods inapplicable to other problems.\nRecently, Sutskever et al. [1] introduced a neural network model for solving the general sequence-\nto-sequence problem, and Bahdanau et al. [2] proposed a related model with an attention mechanism\nthat makes it capable of handling long sequences well. Both models achieve state-of-the-art results\non large scale machine translation tasks (e.g., [3, 4]). Syntactic constituency parsing can be formu-\nlated as a sequence-to-sequence problem if we linearize the parse tree (cf. Figure 2), so we can\napply these models to parsing as well.\nOur early experiments focused on the sequence-to-sequence model of Sutskever et al. [1]. We found\nthis model to work poorly when we trained it on standard human-annotated parsing datasets (1M\ntokens), so we constructed an arti\ufb01cial dataset by labelling a large corpus with the BerkeleyParser.\n\n\u2217Equal contribution\n\n1\n\n\fLSTM3\nin\n\nLSTM2\nin\n\nLSTM1\nin\n\n.\n\nGo\n\n(S\n\n(VP\n\nXX\n\n)VP\n\n.\n\n)S\n\nEND\n\nLSTM3\n\nout\n\nLSTM2\n\nout\n\nLSTM1\n\nout\n\nEND\n\n(S\n\n(VP\n\nXX\n\n)VP\n\n.\n\n)S\n\nFigure 1: A schematic outline of a run of our LSTM+A model on the sentence \u201cGo.\u201d. See text for details.\n\nTo our surprise, the sequence-to-sequence model matched the BerkeleyParser that produced the\nannotation, having achieved an F1 score of 90.5 on the test set (section 23 of the WSJ).\nWe suspected that the attention model of Bahdanau et al. [2] might be more data ef\ufb01cient and we\nfound that it is indeed the case. We trained a sequence-to-sequence model with attention on the small\nhuman-annotated parsing dataset and were able to achieve an F1 score of 88.3 on section 23 of the\nWSJ without the use of an ensemble and 90.5 with an ensemble, which matches the performance of\nthe BerkeleyParser (90.4) when trained on the same data.\nFinally, we constructed a second arti\ufb01cial dataset consisting of only high-con\ufb01dence parse trees, as\nmeasured by the agreement of two parsers. We trained a sequence-to-sequence model with attention\non this data and achieved an F1 score of 92.1 on section 23 of the WSJ, matching the state-of-the-art.\nThis result did not require an ensemble, and as a result, the parser is also very fast.\n\n2 LSTM+A Parsing Model\n\nLet us \ufb01rst recall the sequence-to-sequence LSTM model. The Long Short-Term Memory model of\n[5] is de\ufb01ned as follows. Let xt, ht, and mt be the input, control state, and memory state at timestep\nt. Given a sequence of inputs (x1, . . . , xT ), the LSTM computes the h-sequence (h1, . . . , hT ) and\nthe m-sequence (m1, . . . , mT ) as follows.\n\nit = sigm(W1xt + W2ht\u22121)\ni(cid:48)\nt = tanh(W3xt + W4ht\u22121)\nft = sigm(W5xt + W6ht\u22121)\not = sigm(W7xt + W8ht\u22121)\nmt = mt\u22121 (cid:12) ft + it (cid:12) i(cid:48)\nht = mt (cid:12) ot\n\nt\n\nTB(cid:89)\n\u2261 TB(cid:89)\n\nt=1\n\nt=1\na\n\n2\n\nThe operator (cid:12) denotes element-wise multiplication, the matrices W1, . . . , W8 and the vector h0 are\nthe parameters of the model, and all the nonlinearities are computed element-wise.\nIn a deep LSTM, each subsequent layer uses the h-sequence of the previous layer for its input\nsequence x. The deep LSTM de\ufb01nes a distribution over output sequences given an input sequence:\n\nP (B|A) =\n\nP (Bt|A1, . . . , ATA , B1, . . . , Bt\u22121)\n\nsoftmax(Wo \u00b7 hTA+t)(cid:62)\u03b4Bt,\n\nabove\n\nequation\n\nThe\n=\n(A1, . . . , ATA , B1, . . . , BTB ), so ht denotes t-th element of the h-sequence of topmost LSTM.\nThe matrix Wo consists of the vector representations of each output symbol and the symbol \u03b4b\n\ndeep LSTM whose\n\nsequence\n\nassumes\n\ninput\n\nis x\n\n\fJohn has a dog . \u2192\n\nNP\n\nNNP\n\nVBZ\n\nS\n\nVP\n\nDT\n\nNP\n\n.\n\nNN\n\nJohn has a dog . \u2192\n\n(S (NP NNP )NP (VP VBZ (NP DT NN )NP )VP . )S\n\nFigure 2: Example parsing task and its linearization.\n\nis a Kronecker delta with a dimension for each output symbol, so softmax(Wo \u00b7 hTA+t)(cid:62)\u03b4Bt is\nprecisely the Bt\u2019th element of the distribution de\ufb01ned by the softmax. Every output sequence\nterminates with a special end-of-sequence token which is necessary in order to de\ufb01ne a distribution\nover sequences of variable lengths. We use two different sets of LSTM parameters, one for the input\nsequence and one for the output sequence, as shown in Figure 1. Stochastic gradient descent is used\nto maximize the training objective which is the average over the training set of the log probability\nof the correct output sequence given the input sequence.\n\n2.1 Attention Mechanism\n\nAn important extension of the sequence-to-sequence model is by adding an attention mechanism.\nWe adapted the attention model from [2] which, to produce each output symbol Bt, uses an attention\nmechanism over the encoder LSTM states. Similar to our sequence-to-sequence model described\nin the previous section, we use two separate LSTMs (one to encode the sequence of input words\nAi, and another one to produce or decode the output symbols Bi). Recall that the encoder hidden\nstates are denoted (h1, . . . , hTA) and we denote the hidden states of the decoder by (d1, . . . , dTB ) :=\n(hTA+1, . . . , hTA+TB ).\nTo compute the attention vector at each output time t over the input words (1, . . . , TA) we de\ufb01ne:\n\n1hi + W (cid:48)\ni = vT tanh(W (cid:48)\nut\nat\ni = softmax(ut\ni)\nd(cid:48)\nt =\n\nTA(cid:88)\n\nat\nihi\n\n2dt)\n\ni=1\n\n1, W (cid:48)\n\nThe vector v and matrices W (cid:48)\n2 are learnable parameters of the model. The vector ut has length\nTA and its i-th item contains a score of how much attention should be put on the i-th hidden encoder\nstate hi. These scores are normalized by softmax to create the attention mask at over encoder hidden\nstates. In all our experiments, we use the same hidden dimensionality (256) at the encoder and the\ndecoder, so v is a vector and W (cid:48)\nt with dt,\nwhich becomes the new hidden state from which we make predictions, and which is fed to the next\ntime step in our recurrent model.\nIn Section 4 we provide an analysis of what the attention mechanism learned, and we visualize the\nnormalized attention vector at for all t in Figure 4.\n\n2 are square matrices. Lastly, we concatenate d(cid:48)\n\n1 and W (cid:48)\n\n2.2 Linearizing Parsing Trees\n\nTo apply the model described above to parsing, we need to design an invertible way of converting\nthe parse tree into a sequence (linearization). We do this in a very simple way following a depth-\ufb01rst\ntraversal order, as depicted in Figure 2.\nWe use the above model for parsing in the following way. First, the network consumes the sentence\nin a left-to-right sweep, creating vectors in memory. Then, it outputs the linearized parse tree using\ninformation in these vectors. As described below, we use 3 LSTM layers, reverse the input sentence\nand normalize part-of-speech tags. An example run of our LSTM+A model on the sentence \u201cGo.\u201d\nis depicted in Figure 1 (top gray edges illustrate attention).\n\n3\n\n\f2.3 Parameters and Initialization\n\nSizes.\nwe call LSTM+A. Our input vocabulary size was 90K and we output 128 symbols.\n\nIn our experiments we used a model with 3 LSTM layers and 256 units in each layer, which\n\nDropout. Training on a small dataset we additionally used 2 dropout layers, one between LSTM1\nand LSTM2, and one between LSTM2 and LSTM3. We call this model LSTM+A+D.\n\nPOS-tag normalization. Since part-of-speech (POS) tags are not evaluated in the syntactic pars-\ning F1 score, we replaced all of them by \u201cXX\u201d in the training data. This improved our F1 score by\nabout 1 point, which is surprising: For standard parsers, including POS tags in training data helps\nsigni\ufb01cantly. All experiments reported below are performed with normalized POS tags.\n\nInput reversing. We also found it useful to reverse the input sentences but not their parse trees,\nsimilarly to [1]. Not reversing the input had a small negative impact on the F1 score on our develop-\nment set (about 0.2 absolute). All experiments reported below are performed with input reversing.\n\nPre-training word vectors. The embedding layer for our 90K vocabulary can be initialized ran-\ndomly or using pre-trained word-vector embeddings. We pre-trained skip-gram embeddings of size\n512 using word2vec [6] on a 10B-word corpus. These embeddings were used to initialize our net-\nwork but not \ufb01xed, they were later modi\ufb01ed during training. We discuss the impact of pre-training\nin the experimental section.\nWe do not apply any other special preprocessing to the data. In particular, we do not binarize the\nparse trees or handle unaries in any speci\ufb01c way. We also treat unknown words in a naive way: we\nmap all words beyond our 90K vocabulary to a single UNK token. This potentially underestimates\nour \ufb01nal results, but keeps our framework task-independent.\n3 Experiments\n\n3.1 Training Data\n\nWe trained the model described above on 2 different datasets. For one, we trained on the standard\nWSJ training dataset. This is a very small training set by neural network standards, as it contains\nonly 40K sentences (compared to 60K examples even in MNIST). Still, even training on this set, we\nmanaged to get results that match those obtained by domain-speci\ufb01c parsers.\nTo match state-of-the-art, we created another, larger training set of \u223c11M parsed sentences (250M\ntokens). First, we collected all publicly available treebanks. We used the OntoNotes corpus version\n5 [7], the English Web Treebank [8] and the updated and corrected Question Treebank [9].1 Note\nthat the popular Wall Street Journal section of the Penn Treebank [10] is part of the OntoNotes\ncorpus. In total, these corpora give us \u223c90K training sentences (we held out certain sections for\nevaluation, as described below).\nIn addition to this gold standard data, we use a corpus parsed with existing parsers using the\n\u201ctri-training\u201d approach of [11]. In this approach, two parsers, our reimplementation of Berkeley-\nParser [12] and a reimplementation of ZPar [13], are used to process unlabeled sentences sampled\nfrom news appearing on the web. We select only sentences for which both parsers produced the same\nparse tree and re-sample to match the distribution of sentence lengths of the WSJ training corpus.\nRe-sampling is useful because parsers agree much more often on short sentences. We call the set\nof \u223c11 million sentences selected in this way, together with the \u223c90K golden sentences described\nabove, the high-con\ufb01dence corpus.\nAfter creating this corpus, we made sure that no sentence from the development or test set appears\nin the corpus, also after replacing rare words with \u201cunknown\u201d tokens. This operation guarantees that\nwe never see any test sentence during training, but it also lowers our F1 score by about 0.5 points.\nWe are not sure if such strict de-duplication was performed in previous works, but even with this,\nwe still match state-of-the-art.\n\n1All treebanks are available through the Linguistic Data Consortium (LDC): OntoNotes (LDC2013T19),\n\nEnglish Web Treebank (LDC2012T13) and Question Treebank (LDC2012R121).\n\n4\n\n\fParser\n\nbaseline LSTM+D\n\nLSTM+A+D\n\nLSTM+A+D ensemble\n\nbaseline LSTM\n\nLSTM+A\n\nPetrov et al. (2006) [12]\nZhu et al. (2013) [13]\n\nPetrov et al. (2010) ensemble [14]\n\nZhu et al. (2013) [13]\n\nHuang & Harper (2009) [15]\nMcClosky et al. (2006) [16]\n\nTraining Set\n\nWSJ only\nWSJ only\nWSJ only\n\nBerkeleyParser corpus\nhigh-con\ufb01dence corpus\n\nWSJ only\nWSJ only\nWSJ only\n\nsemi-supervised\nsemi-supervised\nsemi-supervised\n\nWSJ 22 WSJ 23\n< 70\n< 70\n88.3\n88.7\n90.5\n90.7\n90.5\n91.0\n92.8\n92.1\n90.4\n91.1\n90.4\nN/A\n91.8\n92.5\n91.3\nN/A\nN/A\n91.3\n92.1\n92.4\n\nTable 1: F1 scores of various parsers on the development and test set. See text for discussion.\n\nIn earlier experiments, we only used one parser, our reimplementation of BerkeleyParser, to create a\ncorpus of parsed sentences. In that case we just parsed \u223c7 million senteces from news appearing on\nthe web and combined these parsed sentences with the \u223c90K golden corpus described above. We\ncall this the BerkeleyParser corpus.\n\n3.2 Evaluation\n\nWe use the standard EVALB tool2 for evaluation and report F1 scores on our developments set\n(section 22 of the Penn Treebank) and the \ufb01nal test set (section 23) in Table 1.\nFirst, let us remark that our training setup differs from those reported in previous works. To the best\nof our knowledge, no standard parsers have ever been trained on datasets numbering in the hundreds\nof millions of tokens, and it would be hard to do due to ef\ufb01ciency problems. We therefore cite the\nsemi-supervised results, which are analogous in spirit but use less data.\nTable 1 shows performance of our models on the top and results from other papers at the bottom. We\ncompare to variants of the BerkeleyParser that use self-training on unlabeled data [15], or built an\nensemble of multiple parsers [14], or combine both techniques. We also include the best linear-time\nparser in the literature, the transition-based parser of [13].\nIt can be seen that, when training on WSJ only, a baseline LSTM does not achieve any reasonable\nscore, even with dropout and early stopping. But a single attention model gets to 88.3 and an en-\nsemble of 5 LSTM+A+D models achieves 90.5 matching a single-model BerkeleyParser on WSJ 23.\nWhen trained on the large high-con\ufb01dence corpus, a single LSTM+A model achieves 92.1 and so\nmatches the best previous single model result.\n\nGenerating well-formed trees. The LSTM+A model trained on WSJ dataset only produced mal-\nformed trees for 25 of the 1700 sentences in our development set (1.5% of all cases), and the model\ntrained on full high-con\ufb01dence dataset did this for 14 sentences (0.8%). In these few cases where\nLSTM+A outputs a malformed tree, we simply add brackets to either the beginning or the end of\nthe tree in order to make it balanced. It is worth noting that all 14 cases where LSTM+A produced\nunbalanced trees were sentences or sentence fragments that did not end with proper punctuation.\nThere were very few such sentences in the training data, so it is not a surprise that our model cannot\ndeal with them very well.\n\nScore by sentence length. An important concern with the sequence-to-sequence LSTM was that\nit may not be able to handle long sentences well. We determine the extent of this problem by\npartitioning the development set by length, and evaluating BerkeleyParser, a baseline LSTM model\nwithout attention, and LSTM+A on sentences of each length. The results, presented in Figure 3,\nare surprising. The difference between the F1 score on sentences of length upto 30 and that upto\n70 is 1.3 for the BerkeleyParser, 1.7 for the baseline LSTM, and 0.7 for LSTM+A. So already the\nbaseline LSTM has similar performance to the BerkeleyParser, it degrades with length only slightly.\nSurprisingly, LSTM+A shows less degradation with length than BerkeleyParser \u2013 a full O(n3) chart\nparser that uses a lot more memory.\n\n2http://nlp.cs.nyu.edu/evalb/\n\n5\n\n\f96\n\n95\n\n94\n\n93\n\n92\n\n91\n\n90\n\ne\nr\no\nc\ns\n\n1\nF\n\nBerkeleyParser\nbaseline LSTM\nLSTM+A\n\n10\n\n20\n\n30\n\n40\n\nSentence length\n\n50\n\n60\n\n70\n\nFigure 3: Effect of sentence length on the F1 score on WSJ 22.\n\nBeam size in\ufb02uence. Our decoder uses a beam of a \ufb01xed size to calculate the output sequence\nof labels. We experimented with different settings for the beam size. It turns out that it is almost\nirrelevant. We report report results that use beam size 10, but using beam size 2 only lowers the F1\nscore of LSTM+A on the development set by 0.2, and using beam size 1 lowers it by 0.5. Beam\nsizes above 10 do not give any additional improvements.\n\nDropout in\ufb02uence. We only used dropout when training on the small WSJ dataset and its in\ufb02u-\nence was signi\ufb01cant. A single LSTM+A model only achieved an F1 score of 86.5 on our develop-\nment set, that is over 2 points lower than the 88.7 of a LSTM+A+D model.\n\nPre-training in\ufb02uence. As described in the previous section, we initialized the word-vector em-\nbedding with pre-trained word vectors obtained from word2vec. To test the in\ufb02uence of this ini-\ntialization, we trained a LSTM+A model on the high-con\ufb01dence corpus, and a LSTM+A+D model\non the WSJ corpus, starting with randomly initialized word-vector embeddings. The F1 score on our\ndevelopment set was 0.4 lower for the LSTM+A model and 0.3 lower for the LSTM+A+D model\n(88.4 vs 88.7). So the effect of pre-training is consistent but small.\n\nPerformance on other datasets. The WSJ evaluation set has been in use for 20 years and is\ncommonly used to compare syntactic parsers. But it is not representative for text encountered on\nthe web [8]. Even though our model was trained on a news corpus, we wanted to check how well it\ngeneralizes to other forms of text. To this end, we evaluated it on two additional datasets:\n\nQTB 1000 held-out sentences from the Question Treebank [9];\n\nWEB the \ufb01rst half of each domain from the English Web Treebank [8] (8310 sentences).\n\nLSTM+A trained on the high-con\ufb01dence corpus (which only includes text from news) achieved\nan F1 score of 95.7 on QTB and 84.6 on WEB. Our score on WEB is higher both than the best\nscore reported in [8] (83.5) and the best score we achieved with an in-house reimplementation of\nBerkeleyParser trained on human-annotated data (84.4). We managed to achieve a slightly higher\nscore (84.8) with the in-house BerkeleyParser trained on a large corpus. On QTB, the 95.7 score of\nLSTM+A is also lower than the best score of our in-house BerkeleyParser (96.2). Still, taking into\naccount that there were only few questions in the training data, these scores show that LSTM+A\nmanaged to generalize well beyond the news language it was trained on.\n\nParsing speed. Our LSTM+A model, running on a multi-core CPU using batches of 128 sentences\non a generic unoptimized decoder, can parse over 120 sentences from WSJ per second for sentences\nof all lengths (using beam-size 1). This is better than the speed reported for this batch size in Figure\n4 of [17] at 100 sentences per second, even though they run on a GPU and only on sentences of\nunder 40 words. Note that they achieve 89.7 F1 score on this subset of sentences of section 22,\nwhile our model at beam-size 1 achieves a score of 93.2 on this subset.\n\n6\n\n\fFigure 4: Attention matrix. Shown on top is the attention matrix where each column is the attention\nvector over the inputs. On the bottom, we show outputs for four consecutive time steps, where the\nattention mask moves to the right. As can be seen, every time a terminal node is consumed, the\nattention pointer moves to the right.\n\n4 Analysis\n\nAs shown in this paper, the attention mechanism was a key component especially when learning\nfrom a relatively small dataset. We found that the model did not over\ufb01t and learned the parsing\nfunction from scratch much faster, which resulted in a model which generalized much better than\nthe plain LSTM without attention.\nOne of the most interesting aspects of attention is that it allows us to visualize to interpret what the\nmodel has learned from the data. For example, in [2] it is shown that for translation, attention learns\nan alignment function, which certainly should help translating from English to French.\nFigure 4 shows an example of the attention model trained only on the WSJ dataset. From the\nattention matrix, where each column is the attention vector over the inputs, it is clear that the model\nfocuses quite sharply on one word as it produces the parse tree. It is also clear that the focus moves\nfrom the \ufb01rst word to the last monotonically, and steps to the right deterministically when a word is\nconsumed.\nOn the bottom of Figure 4 we see where the model attends (black arrow), and the current output\nbeing decoded in the tree (black circle). This stack procedure is learned from data (as all the param-\neters are randomly initialized), but is not quite a simple stack decoding. Indeed, at the input side, if\nthe model focuses on position i, that state has information for all words after i (since we also reverse\nthe inputs). It is worth noting that, in some examples (not shown here), the model does skip words.\n\n7\n\n\f5 Related Work\n\nThe task of syntactic constituency parsing has received a tremendous amount of attention in the last\n20 years. Traditional approaches to constituency parsing rely on probabilistic context-free grammars\n(CFGs). The focus in these approaches is on devising appropriate smoothing techniques for highly\nlexicalized and thus rare events [18] or carefully crafting the model structure [19]. [12] partially\nalleviate the heavy reliance on manual modeling of linguistic structure by using latent variables\nto learn a more articulated model. However, their model still depends on a CFG backbone and is\nthereby potentially restricted in its capacity.\nEarly neural network approaches to parsing, for example by [20, 21] also relied on strong linguistic\ninsights.\n[22] introduced Incremental Sigmoid Belief Networks for syntactic parsing. By con-\nstructing the model structure incrementally, they are able to avoid making strong independence\nassumptions but inference becomes intractable. To avoid complex inference methods, [23] propose\na recurrent neural network where parse trees are decomposed into a stack of independent levels.\nUnfortunately, this decomposition breaks for long sentences and their accuracy on longer sentences\nfalls quite signi\ufb01cantly behind the state-of-the-art. [24] used a tree-structured neural network to\nscore candidate parse trees. Their model however relies again on the CFG assumption and further-\nmore can only be used to score candidate trees rather than for full inference.\nOur LSTM model signi\ufb01cantly differs from all these models, as it makes no assumptions about\nthe task. As a sequence-to-sequence prediction model it is somewhat related to the incremental\nparsing models, pioneered by [25] and extended by [26]. Such linear time parsers however typically\nneed some task-speci\ufb01c constraints and might build up the parse in multiple passes. Relatedly, [13]\npresent excellent parsing results with a single left-to-right pass, but require a stack to explicitly\ndelay making decisions and a parsing-speci\ufb01c transition strategy in order to achieve good parsing\naccuracies. The LSTM in contrast uses its short term memory to model the complex underlying\nstructure that connects the input-output pairs.\nRecently, researchers have developed a number of neural network models that can be applied to\ngeneral sequence-to-sequence problems.\n[27] was the \ufb01rst to propose a differentiable attention\nmechanism for the general problem of handwritten text synthesis, although his approach assumed a\nmonotonic alignment between the input and output sequences. Later, [2] introduced a more general\nattention model that does not assume a monotonic alignment, and applied it to machine translation,\nand [28] applied the same model to speech recognition. [29] used a convolutional neural network to\nencode a variable-sized input sentence into a vector of a \ufb01xed dimension and used a RNN to pro-\nduce the output sentence. Essentially the same model has been used by [30] to successfully learn to\ngenerate image captions. Finally, already in 1990 [31] experimented with applying recurrent neural\nnetworks to the problem of syntactic parsing.\n\n6 Conclusions\n\nIn this work, we have shown that generic sequence-to-sequence approaches can achieve excellent\nresults on syntactic constituency parsing with relatively little effort or tuning. In addition, while\nwe found the model of Sutskever et al. [1] to not be particularly data ef\ufb01cient, the attention model\nof Bahdanau et al. [2] was found to be highly data ef\ufb01cient, as it has matched the performance of\nthe BerkeleyParser when trained on a small human-annotated parsing dataset. Finally, we showed\nthat synthetic datasets with imperfect labels can be highly useful, as our models have substantially\noutperformed the models that have been used to create their training data. We suspect it is the\ncase due to the different natures of the teacher model and the student model: the student model\nhas likely viewed the teacher\u2019s errors as noise which it has been able to ignore. This approach was\nso successful that we obtained a new state-of-the-art result in syntactic constituency parsing with\na single attention model, which also means that the model is exceedingly fast. This work shows\nthat domain independent models with excellent learning algorithms can match and even outperform\ndomain speci\ufb01c models.\nAcknowledgement. We would like to thank Amin Ahmad, Dan Bikel and Jonni Kanerva.\n\n8\n\n\fReferences\n[1] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[3] Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare\n\nword problem in neural machine translation. arXiv preprint arXiv:1410.8206, 2014.\n\n[4] S\u00b4ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target\n\nvocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.\n\n[5] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations\n\nin vector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[7] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. Ontonotes: The\n\n90% solution. In NAACL. ACL, June 2006.\n\n[8] Slav Petrov and Ryan McDonald. Overview of the 2012 shared task on parsing the web. Notes of the\n\nFirst Workshop on Syntactic Analysis of Non-Canonical Language (SANCL), 2012.\n\n[9] John Judge, Aoife Cahill, and Josef van Genabith. Questionbank: Creating a corpus of parse-annotated\n\nquestions. In Proceedings of ICCL & ACL\u201906, pages 497\u2013504. ACL, July 2006.\n\n[10] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus\n\nof english: The penn treebank. Computational Linguistics, 19(2):313\u2013330, 1993.\n\n[11] Zhenghua Li, Min Zhang, and Wenliang Chen. Ambiguity-aware ensemble training for semi-supervised\n\ndependency parsing. In Proceedings of ACL\u201914, pages 457\u2013467. ACL, 2014.\n\n[12] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and inter-\n\n[13] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce\n\npretable tree annotation. In ACL. ACL, July 2006.\n\nconstituent parsing. In ACL. ACL, August 2013.\n\n[14] Slav Petrov. Products of random latent variable grammars. In Human Language Technologies: The 2010\n\nAnnual Conference of the North American Chapter of the ACL, pages 19\u201327. ACL, June 2010.\n\n[15] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across lan-\n\nguages. In EMNLP. ACL, August 2009.\n\n[16] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In NAACL.\n\n[17] David Hall, Taylor Berg-Kirkpatrick, John Canny, and Dan Klein. Sparser, better, faster gpu parsing. In\n\n[18] Michael Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th\n\nAnnual Meeting of the ACL, pages 16\u201323. ACL, July 1997.\n\n[19] Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing.\n\nIn Proceedings of the 41st\n\nAnnual Meeting of the ACL, pages 423\u2013430. ACL, July 2003.\n\n[20] James Henderson. Inducing history representations for broad coverage statistical parsing. In NAACL,\n\n[21] James Henderson. Discriminative training of a neural network statistical parser. In Proceedings of the\n\n42nd Meeting of the ACL (ACL\u201904), Main Volume, pages 95\u2013102, July 2004.\n\n[22] Ivan Titov and James Henderson. Constituent parsing with incremental sigmoid belief networks. In ACL.\n\nACL, June 2006.\n\nACL, 2014.\n\nMay 2003.\n\nACL, June 2007.\n\n[23] Ronan Collobert. Deep learning for ef\ufb01cient discriminative parsing.\n\nIn International Conference on\n\nArti\ufb01cial Intelligence and Statistics, 2011.\n\n[24] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and natural\n\nlanguage with recursive neural networks. In ICML, 2011.\n\n[25] Adwait Ratnaparkhi. A linear observed time statistical parser based on maximum entropy models. In\n\nSecond Conference on Empirical Methods in Natural Language Processing, 1997.\n\n[26] Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. In Proceedings of\n\nthe 42nd Meeting of the ACL (ACL\u201904), Main Volume, pages 111\u2013118, July 2004.\n\n[27] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,\n\n2013.\n\n1709, 2013.\n\n[28] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end continuous speech\n\nrecognition using attention-based recurrent nn: First results. arXiv preprint arXiv:1412.1602, 2014.\n\n[29] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, pages 1700\u2013\n\n[30] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\n\ncaption generator. arXiv preprint arXiv:1411.4555, 2014.\n\n[31] Zoubin Ghahramani. A neural network for learning how to parse tree adjoining grammar. B.S.Eng Thesis,\n\nUniversity of Pennsylvania, 1990.\n\n9\n\n\f", "award": [], "sourceid": 1584, "authors": [{"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google"}, {"given_name": "\u0141ukasz", "family_name": "Kaiser", "institution": "Google"}, {"given_name": "Terry", "family_name": "Koo", "institution": "Google"}, {"given_name": "Slav", "family_name": "Petrov", "institution": "Google"}, {"given_name": "Ilya", "family_name": "Sutskever", "institution": "Google"}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": "Google"}]}