{"title": "Paraphrase Generation with Latent Bag of Words", "book": "Advances in Neural Information Processing Systems", "page_first": 13645, "page_last": 13656, "abstract": "Paraphrase generation is a longstanding important problem in natural language processing. \n  Recent progress in deep generative models has shown promising results on discrete latent variables for text generation. \n  Inspired by variational autoencoders with discrete latent structures, \n  in this work, we propose a latent bag of words (BOW) model for paraphrase generation.\n  We ground the semantics of a discrete latent variable by the target BOW. \n  We use this latent variable to build a fully differentiable content planning and surface realization pipeline. \n  Specifically, we use source words to predict their neighbors and model the target BOW with a mixture of softmax. \n  We use gumbel top-k reparameterization to perform differentiable subset sampling from the predicted BOW distribution.\n  We retrieve the sampled word embeddings and use them to augment the decoder and guide its generation search space. \n  Our latent BOW model not only enhances the decoder, but also exhibits clear interpretability.\n  We show the model interpretability with regard to (1). unsupervised learning of word neighbors (2). the step-by-step generation procedure. \n  Extensive experiments demonstrate the model's transparent and effective generation process.", "full_text": "Paraphrase Generation with Latent Bag of Words\n\nYao Fu\n\nYansong Feng\n\nDepartment of Computer Science\n\nInstitute of Computer Science and Technology\n\nColumbia University\n\nyao.fu@columbia.edu\n\nPeking University\n\nfengyansong@pku.edu.cn\n\nJohn P. Cunningham\nDepartment of Statistics\n\nColumbia University\n\njpc2181@columbia.edu\n\nAbstract\n\nParaphrase generation is a longstanding important problem in natural language\nprocessing. In addition, recent progress in deep generative models has shown\npromising results on discrete latent variables for text generation.\nInspired by\nvariational autoencoders with discrete latent structures, in this work, we propose\na latent bag of words (BOW) model for paraphrase generation. We ground the\nsemantics of a discrete latent variable by the BOW from the target sentences. We\nuse this latent variable to build a fully differentiable content planning and surface\nrealization model. Speci\ufb01cally, we use source words to predict their neighbors\nand model the target BOW with a mixture of softmax. We use Gumbel top-k\nreparameterization to perform differentiable subset sampling from the predicted\nBOW distribution. We retrieve the sampled word embeddings and use them to\naugment the decoder and guide its generation search space. Our latent BOW model\nnot only enhances the decoder, but also exhibits clear interpretability. We show the\nmodel interpretability with regard to (i) unsupervised learning of word neighbors\n(ii) the step-by-step generation procedure. Extensive experiments demonstrate the\ntransparent and effective generation process of this model.1\n\n1\n\nIntroduction\n\nThe generation of paraphrases is a longstanding problem for learning natural language [33]. Para-\nphrases are de\ufb01ned as sentences conveying the same meaning but with different surface realization.\nFor example, in a question answering website, people may ask duplicated questions like How do I\nimprove my English v.s. What is the best way to learn English. Paraphrase generation is important,\nnot only because paraphrases demonstrate the diverse nature of human language, but also because the\ngeneration system can be the key component to other important language understanding tasks, such\nas question answering[5, 11], machine translation [7], and semantic parsing [43].\nTraditional models are generally rule based, which \ufb01nd lexical substitutions from WordNet [34] style\nresources, then substitute the content words accordingly [3, 36, 21]. Recent neural models primary rely\non the sequence-to-sequence (seq2seq) learning framework [44, 40], achieving inspiring performance\ngains over the traditional methods. Despite its effectiveness, there is no strong interpretability of\nseq2seq learning. The sentence embedding encoded by the encoder is not directly associated with\nany linguistic aspects of that sentence2. On the other hand, although interpretable, many traditional\n\n1Our code can be found at https://github.com/FranxYao/dgm_latent_bow\n2The linguistic aspects we refer include but not limited to: words, phrases, syntax, and semantics.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Our model equip the seq2seq model(lower part) with latent bag of words(upper part).\n\nmethods suffer from suboptimal performance [40]. In this work we introduce a model with optimal\nperformance that maintains and bene\ufb01ts from semantic interpretability.\nTo improve model interpretability, researchers typically follow two paths. First, from a probabilistic\nperspective, one might encode the source sentence into a latent code with certain structures [22]\n(e.g. a Gaussian variable for the MNIST[24] dataset). From a traditional natural language genera-\ntion(NLG) perspective, one might explicitly separate content planning and surface realization [35].\nThe traditional word substitution models for paraphrase generation are an example of planning and\nrealization: \ufb01rst, word neighbors are retrieved from WordNet (the planning stage), and then words\nare substituted and re-organized to form a paraphrase (the realization stage). Neighbors of a given\nword refer to words that are semantically close to the given word (e.g. improve \u2192 learn). Here the\ninterpretability comes from a linguistic perspective, since the model performs generation step by step:\nit \ufb01rst proposes the content, then generates according to the proposal. Although effective across many\napplications, both approaches have their own drawbacks. The probabilistic approach lacks explicit\nconnection between the code and the semantic meaning, whereas for the traditional NLG approach,\nseparation of planning and realization is (across most models) nondifferentiable [6, 35], which then\nsacri\ufb01ces the end-to-end learning capabilities of network models, a step that has proven critical in a\nvast number of deep learning settings.\nIn an effort to bridge these two approaches, we propose a hierarchical latent bag of words model for\nplanning and realization. Our model uses words of the source sentence to predict their neighbors in\nthe bag of words from target sentences 3. From the predicted word neighbors, we sample a subset\nof words as our content plan, and organize these words into a full sentence. We use Gumbel top-k\nreparameterization[20, 32] for differentiable subset sampling[49], making the planning and realization\nfully end-to-end. Our model then exhibits interpretability of from both of the two perspectives: from\nthe probabilistic perspective, since we optimize a discrete latent variable towards the bag of words of\nthe target sentence, the meaning of this variable is grounded with explicit lexical semantics; from\nthe traditional NLG perspective, our model follows the planning and realization steps, yet fully\ndifferentiable. Our contributions are:\n\n\u2022 We endow a hierarchical discrete latent variable with explicit lexical semantics. Speci\ufb01cally,\n\nwe use the bag of words from the target sentences to ground the latent variable.\n\n\u2022 We use this latent variable model to build a differentiable step by step content planning and\n\nsurface realization pipeline for paraphrase generation.\n\n\u2022 We demonstrate the effectiveness of our model with extensive experiments and show its\ninterpretability with respect to clear generation steps and the unsupervised learning of word\nneighbors.\n\n2 Model\n\nOur goal is to extend the seq2seq model (\ufb01gure 1 lower part) with differentiable content planning and\nsurface realization (\ufb01gure 1 upper part). We begin with a discussion about the seq2seq base model.\n\n3In practice, we gather the words from target sentences into a set. This set is our target BOW.\n\n2\n\nx=x1x2...xmy1...yn=yhz11\u22eez1lz21\u22eez2lzm1\u22eezml\u03d51\u03d52\u03d5mzw1w2\u22eewkSentence encoderWord encoderDecoderWord NeighborsTarget BOWGumbel-topkSource sentenceTarget sentenceMixture of softmaxencdec\u03c8\u03b8Content planningSurface RealizationLatent BOWSeq2seq\f2.1 The Sequence to Sequence Base Model.\n\nThe classical seq2seq model encodes the source sequence x = x1, x2, ..., xm into a code h, and\ndecodes it to the target sequence y (Figure 1 lower part), where m and n are the length of the source\nand the target, respectively [44, 2]. The encoder enc\u03c8 and the decoder dec\u03b8 are both deep networks.\nIn our work, they are implemented as LSTMs[18]. The loss function is the negative log likelihood.\n\nh = enc\u03c8(x)\np(y|x) = dec\u03b8(h)\nLS2S = E(x(cid:63),y(cid:63))\u223cP(cid:63) [\u2212 log p\u03b8(y(cid:63)|x(cid:63))],\n\n(1)\n\n(4)\n\n(5)\n\nwhere P(cid:63) is the true data distribution. The model is trained with gradient based optimizer, and we\nuse Adam [23] in this work. In this setting, the code h does not have direct interpretability. To add\ninterpretability, in our model, we ground the meaning of a latent structure with lexical semantics.\n\n2.2 Bag of Words for Content Planning.\n\nNow we consider formulating a plan as a bag of words before the surface realization process. Formally,\nlet V be the vocabulary of size V ; then a bag of words (BOW) z of size k is a random set formulated\nas a k-hot vector in RV . We assume z is sampled from a base categorical distribution p(\u02dcz|x) by k\ntimes without replacement. Directly modeling the distribution of z is hard due to the combinatorial\ncomplexity, so instead we model its base categorical distribution p(\u02dcz|x). In paraphrase generation\ndatasets, one source sentence may correspond to multiple target sentences. Our key modeling\nassumption is that the BOW from target sentences (target BOW) should be similar to the neighbors\nof the words in the source sentence. As such, we de\ufb01ne the base categorial variable \u02dcz as the mixture\nof all neighbors of all source words. Namely, \ufb01rst, for each source word xi, we model its neighbor\nword with a one-hot zij \u2208 RV :\n\np(zij|xi) = Categorical(\u03c6ij(xi))\n\n(2)\nThe support of zij is also the word vocabulary V and \u03c6ij is parameterized by a neural network. In\npractice, we use a softmax layer on top of the hidden states of each timesteps from the encoder\nLSTM. We assume a \ufb01xed total number of neighbors l for each xi (j = 1, ..., l). We then mix the\nprobabilities of these neighbors:\n\n\u02dcz \u223c p\u03c6(\u02dcz|x) =\n\n1\nml\n\np(zij|xi)\n\n(3)\n\n(cid:88)\n\ni,j\n\nwhere ml is the maximum number of predicted words. \u02dcz is a categorical variable mixing all neighbor\nwords. We construct the bag of words z by sampling from p\u03c6(\u02dcz|x) by k times without replacement.\nThen we use z as the plan for decoding y. The generative process can be written as:\n\nFor optimization, we maximize the negative log likelihood of p(y|x, z) and p\u02dcz(\u02dcz|x):\n\nz \u223c p\u03c6(\u02dcz|x)\ny \u223c p\u03b8(y|x, z) = dec\u03b8(x, z)\n\n(sample k times without replacement)\n\nLS2S(cid:48) = E(x(cid:63),y(cid:63))\u223cP(cid:63),z\u223cp\u03c6(\u02dcz|x)[\u2212 log p\u03b8(y(cid:63)|x(cid:63), z)]\nLBOW = Ez\u2217\u223cP\u2217 [\u2212 log p\u03c6(z\u2217|x)]\nLtot = LS2S(cid:48) + LBOW\n\nwhere P\u2217 is the true distribution of the BOW from the target sentences. z\u2217 is a k-hot vector\nrepresenting the target bag of words. One could also view LBOW as a regularization of p\u03c6 using the\nweak supervision from target bag of words. Another choice is to view z as completely latent and\ninfer them like a canonical latent variable model.4 We \ufb01nd out using the target BOW regularization\nsigni\ufb01cantly improves the performance and interpretability. Ltot is the total loss to optimize over the\nparameters \u03c8, \u03b8, \u03c6. Note that for a given source word in a particular training instance, the NLL loss\ndoes not penalize the predictions not included in the targets of this instance. This property makes\nthe model be able to learn different neighbors from different data points, i.e., the learned neighbors\nwill be at a corpus level, rather than sentence level. We will further demonstrate this property in our\nexperiments.\n\n4In such case, one could consider variational inference (VI) with q(z|x) and regularize the variational\nposterior. Similar with our relexation of p(z|x) using p(\u02dcz|x), there should also be certain relaxations over the\nvariational family to make the inference tractable. We leave this to future work.\n\n3\n\n\f2.3 Differentiable Subset Sampling with Gumbel Top-k Reparameterization.\n\nAs is discussed in the previous section, the sampling of z (sample k items from a categorical\ndistribution) is non-differentiable. 5 To back-propagate the gradients through \u03c6 in LS2S(cid:48) in equation 5,\nwe choose a reparametrized gradient estimator, which relies on the gumbel-softmax trick. Speci\ufb01cally,\nwe perform differentiable subset sampling with the gumbel-topk reparametrization [25]. Let the\nprobability of \u02dcz to be p(\u02dcz = i|x) = \u03c0i, i \u2208 {1, 2, ...,V}, we obtain the perturbed weights and\nprobabilities by:\n\nai = log \u03c0i + gi\ngi \u223c Gumbel(0, 1)\n\n(6)\n\nRetrieving the k largest weights topk(a1, ..., aV ) will give us k sample without replacement. This\nprocess is shown in dashed lines in \ufb01gure 1. We retrieve the k sampled word embeddings w1, ..., wk\nand re-weight them with their probability \u03c0i. Then we used the average of the weighted word\nembeddings as the decoder LSTM\u2019s initial state to perform surface realization.\nIntuitively, in addition to the sentence code h, the decoder also takes the weighted sample word\nembeddings and performs attention[2] to them; thus differentiability is achieved. This generated\nplan will restrict the decoding space towards the bag of words of the target sentences. More detailed\ninformation about the network architecture and the parameters are in the appendix. In section 4, we\nuse extensive experiments to demonstrate the effectiveness of our model.\n\n3 Related Work\n\nParaphrase Generation. Paraphrases capture the essence of language diversity [39] and often play\nimportant roles in many challenging natural language understanding tasks like question answering\n[5, 11], semantic parsing [43] and machine translation [7]. Traditional methods generally employ\nrule base content planning and surface realization procedures [3, 36, 21]. These methods often rely\non WordNet [34] style word neighbors for selecting substitutions. Our model can unsupervised learn\nthe word neighbors and predict them on the \ufb02y. Recent end-to-end models for paraphrase generation\ninclude the attentive seq2seq model[43], the Residual LSTM model [40], the Gaussian VAE model\n[15], the copy and constrained decoding model [6], and the reinforcement learning approach [26].\nOur model has connections to the copy and constrained decoding model by Cao et al. [6]. They use\nan IBM alignment [9] model to restrict the decoder\u2019s search space, which is not differentiable. We use\nthe latent BOW model to guide the decoder and use the gumbel topk to make the whole generation\ndifferentiable. Compared with previous models, our model learns word neighbors in an unsupervised\nway and exhibits a differentiable planning and realization process.\nLatent Variable Models for Text. Deep latent variable models have been an important recent\ntrend [22, 12] in text modeling. One common path is for researchers to start from a standard VAE\nwith a Gaussian prior [4], which may perhaps encouter issues due to posterior collapse [10, 16].\nMultiple approaches have been proposed to control the tradeoff between the inference network and\nthe generative network [52, 50]. In particular, the \u03b2\u2212VAE [17] use a balance parameter \u03b2 to balance\nthe two models in an intuitive way. This approach will form one of our baselines.\nMany discrete aspects of the text may not be captured by a continuous latent variable. To better \ufb01t\nthe discrete nature of sentences, with the help of the Gumbel-softmax trick [32, 20], recent works try\nto add discrete structures to the latent variable [53, 48, 8]. Our work directly maps the meaning of\na discrete latent variable to the bag of words from the target sentences. To achieve this, we utilize\nthe recent differentiable subset sampling [49] with the Gumbel top-k [25] reparameterization. It is\nalso noticed that the multimodal nature of of text can pose challenges for the modeling process [53].\nPrevious works show that mixture models may come into aid [1, 51]. In our work, we show the\neffectiveness of the mixture of softmax for the multimodal bag of words distribution.\nContent Planning and Surface Realization. The generation process of natural language can be\ndecomposed into two steps: content planning and surface realization (also called sentence generation)\n[35]. The seq2seq model [44] implicitly performs the two steps by encoding the source sentence\ninto an embedding and generating the target sentence with the decoder LSTM. A downside is that\nthis intermediate embedding makes it hard to explicitly control or interpret the generation process\n\n5One choice could be the score function estimator, but empirically it suffers from high variance.\n\n4\n\n\f[2, 35]. Previous works have shown that explicit planning before generation can improve the overall\nperformance. Puduppully et al. [41], Sha et al. [42], Gehrmann et al. [13], Liu et al. [30] embed the\nplanning process into the network architecture. Moryossef et al. [35] use a rule based model for\nplanning and a neural model for realization. Wiseman et al. [48] use a latent variable to model the\nsentence template. Wang et al. [46] use a latent topic to model the topic BOW, while Ma et al. [31]\nuse BOW as regularization. Conceptually, our model is similar to Moryossef et al. [35] as we both\nperform generation step by step. Our model is also related to Ma et al. [31]. While they use BOW\nfor regularization, we map the meaning of the latent variable to the target BOW, and use the latent\nvariable to guide the generation.\n\n4 Experiments\n\nDatasets and Metrics. Following the settings in previous works [26, 15], we use the Quora6 dataset\nand the MSCOCO[28] dataset for our experiments. The MSCOCO dataset was originally developed for\nimage captioning. Each image is associated with 5 different captions. These captions are generally\nclose to each other since they all describe the same image. Although there is no guarantee that the\ncaptions must be paraphrases as they may describe different objects in the same image, the overall\nquality of this dataset is favorable. In our experiments, we use 1 of the 5 captions as the source and\nall content words7 from the rest 4 sentences as our BOW objective. We randomly choose one of\nthe rest 4 captions as the seq2seq target. The Quora dataset is originally developed for duplicated\nquestion detection. Duplicated questions are labeled by human annotators and guaranteed to be\nparaphrases. In this dataset we only have two sentences for each paraphrase set, so we randomly\nchoose one as the source, the other as the target. After processing, for the Quora dataset, there are\n50K training instances and 20K testing instances, and the vocabulary size is 8K. For the MSCOCO\ndataset, there are 94K training instances and 23K testing instances, and the vocabulary size is 11K.\nWe set the maximum sentence length for the two datasets to be 16. More details about datasets and\npre-processing are shown in the appendix.\nAlthough the evaluation of text generation can be challenging [37, 29, 47], previous works show that\nmatching based metrics like BLEU [38] or ROUGE [27] are suitable for this task as they correlate\nwith human judgment well [26]. We report all lower ngram metrics (1-4 grams in BLEU, 1-2 gram in\nROUGE) because these have been shown preferable for short sentences [26, 29].\nBaseline Models. We use the seq2seq LSTM with residual connections [40] and attention mechanism\n[2] as our baseline (Residual Seq2seq-Attn). We also use the \u03b2\u2212VAE as a baseline generative model\nand control the \u03b2 parameter to balance the reconstruction and the recognition networks. Since the\nVAE models do not utilize the attention mechanism, we also include a vanilla sequence to sequence\nbaseline without attention (Seq2seq). It should be noted that although we do not include other SOTA\nmodels like the Transformer [45], the Seq2seq-Attn model is trained with 500 state size and 2 stacked\nLSTM layers, strong enough and hard to beat. We also use a hard version of our BOW model\n(BOW-hard) as a lower bound, which optimizes the encoder and the decoder separately, and pass\nno gradient back from the decoder to the encoder. We compare two versions of our latent BOW\nmodel: the topk version (LBOW-Topk), which directly chooses the most k probable words from the\nencoder, and the gumbel version (LBOW-Gumbel), which samples from the BOW distribution with\ngumbel reparameterization, thus injecting randomness into the model. Additionally, we also consider\na cheating model that is able see the BOW of the actual target sentences during generation (Cheating\nBOW). This model can be considered as an upper bound of our models. The evaluation of the LBOW\nmodels are performed on the held-out test set so they cannot see the target BOW. All above models\nare approximately the same size, and the comparison is fair. In addition, we compare our results\nwith Li et al. [26]. Their model is SOTA on the Quora dataset. The numbers of their model are not\ndirectly comparable to ours since they use twice larger data containing negative samples for inverse\nreinforcement learning.8 Experiments are repeated three times with different random seeds. The\naverage performance is reported. More con\ufb01guration details are listed in the appendix.\n\n6https://www.kaggle.com/aymenmouelhi/quora-duplicate-questions\n7 We view nouns, verbs, adverbs, and adjectives as content words. We view pronouns, prepositions,\n\nconjunctions and punctuation as non-content words.\n\n8They do not release their code so their detailed data processing should also be different with ours, making\n\nthe results not directly comparable.\n\n5\n\n\fTable 1: Results on the Quora and MSCOCO dataset. B for BLEU and R for ROUGE.\n\nModel\nSeq2seq[40]\nResidual Seq2seq-Attn [40]\n\u03b2-VAE, \u03b2 = 10\u22123[17]\n\u03b2-VAE, \u03b2 = 10\u22124[17]\nBOW-Hard (lower bound)\nLBOW-Topk (ours)\nLBOW-Gumbel (ours)\nRbM-SL[26]\nRbM-IRL[26]\nCheating BOW (upper bound)\n\nModel\nSeq2seq[40]\nResidual Seq2seq-Attn [40]\n\u03b2-VAE, \u03b2 = 10\u22123[17]\n\u03b2-VAE, \u03b2 = 10\u22124[17]\nBOW-Hard (lower bound)\nLBOW-Topk (ours)\nLBOW-Gumbel (ours)\nCheating BOW (upper bound)\n\nQuora\nB-2\n40.41\n40.49\n28.60\n33.21\n21.18\n42.03\n41.96\n43.54\n43.09\n61.78\n\nMSCOCO\nB-2\n47.14\n49.65\n45.82\n47.59\n28.35\n51.14\n50.81\n75.09\n\nB-3\n31.25\n31.25\n20.98\n24.96\n14.43\n32.71\n32.66\n-\n-\n54.40\n\nB-3\n31.64\n34.04\n30.56\n32.29\n16.25\n35.66\n35.32\n62.24\n\nB-1\n54.62\n54.59\n43.02\n47.86\n33.40\n55.79\n55.75\n-\n-\n72.96\n\nB-1\n69.61\n71.24\n68.81\n70.04\n48.14\n72.60\n72.37\n80.87\n\nB-4\n24.97\n24.89\n16.29\n19.73\n10.36\n26.17\n26.14\n-\n-\n49.47\n\nB-4\n21.65\n23.66\n20.99\n22.54\n9.28\n25.27\n24.98\n52.64\n\nR-1\n57.27\n57.10\n41.81\n47.62\n36.08\n58.79\n58.60\n64.39\n64.02\n72.15\n\nR-1\n40.11\n41.07\n39.63\n40.72\n31.66\n42.08\n42.12\n49.95\n\nR-2\n33.04\n32.86\n21.17\n25.49\n16.23\n34.57\n34.47\n38.11\n37.72\n52.61\n\nR-2\n14.31\n15.26\n13.86\n14.75\n8.30\n16.13\n16.05\n23.94\n\nR-L\n54.62\n54.61\n40.09\n45.46\n33.77\n56.43\n56.23\n-\n-\n68.53\n\nR-L\n36.28\n37.35\n35.81\n36.75\n27.37\n38.16\n38.13\n43.77\n\n4.1 Experiment Results\n\nTable 1 show the overall performance of all models. Our models perform the best compared with the\nbaselines. The Gumbel version performs slightly worse than the topk version, but they are generally\non par. The margins over the Seq2seq-Attn are not that large (approximately 1+ BLEUs). This is\nbecause the capacity of all models are large enough to \ufb01t the datasets fairly well. The BOW-Hard\nmodel does not perform as well, indicating that the differentiable subset sampling is important for\ntraining our discrete latent model. Although not directly comparable, the numbers of RbM models are\nhigher than ours since they are SOTA models on Quora. But they are still not as high as the Cheating\nBOW\u2019s, which is consistent with our analysis. The cheating BOW outperforms all other models by a\nlarge margin with the leaked BOW information in the target sentences. This shows that the Cheating\nBOW is indeed a meaningful upper bound and the accuracy of the predicted BOW is essential for\nan effective decoding process. Additionally, we notice that \u03b2\u2212VAEs are not as good as the vanilla\nSeq2seq models. The conjecture is that it is dif\ufb01cult to \ufb01nd a good balance between the latent code\nand the generative model. In comparison, our model directly grounds the meaning of the latent\nvariable to be the bag of words from target sentences. In the next section, we show this approach\nfurther induces the unsupervised learning of word neighbors and the interpretable generation stages.\n\n5 Result Analysis\n\n5.1 Model Interpretability\n\nFigure 2 shows the planning and realization stages of our model. Given a source sentence, it\n\ufb01rst generates the word neighbors, samples from the generated BOW (planning), and generates\nIn addition to the statistical interpretability, our model shows clear\nthe sentence (realization).\nlinguistical interpretability. Compared to the vanilla seq2seq model, the interpretability comes from:\n(1). Unsupervised learning of word neighbors (2). The step-by-step generation process.\nUnsupervised Learning of Word Neighbors. As highlighted in Figure 2, we notice that the model\ndiscovers multiple types of lexical semantics among word neighbors, including: (1). word morphol-\nogy, e.g., speak - speaking - spoken (2). synonym, e.g., big - large, racket - racquet. (3). entailment,\n\n6\n\n\fFigure 2:\nSentence generation samples. Our model exhibits clear interpretability with three\ngeneration steps: (1) generate the neighbors of the source words (2) sample from the neighbor BOW\n(3) generate from the BOW sample. Different types of learned lexical semantics are highlighted.\n\ne.g., improve - english (4). metonymy9, e.g., search - googling. The identical mapping is also learned\n(e.g., blue - blue) since all words are neighbors to themselves.\nThe model can learn this is because, although without explicit alignment, words from the target\nsentences are semantically close to the source words. The mixture model drops the order information\nof the source words and effectively match the predicted word set to the BOW from the target sentences.\nThe most prominent word-neighbor relation will be back-propagated to words in the source sentences\nduring optimization. Consequently, the model discovers the word similarity structures.\nThe Generation Steps. A generation plan is formulated by the sampling procedure from the BOW\nprediction. Consequently, an accurate prediction of the BOW is essential for guiding the decoder, as\nis demonstrated by the Cheating BOW model in the previous section. The decoder then performs\nsurface realization based on the plan. During realization, the source of word choice comes from (1).\nthe plan (2). the decoder\u2019s language model. As we can see from the second example in Figure 2, the\nplanned words include english, speak, improve, and the decoder generates other necessary words like\nhow, i, my from its language model to connect the plan words, forming the output: how can i improve\nmy english speaking? In the next section, we quantitatively analyze the performance of the BOW\nprediction and investigate how it is utilized by the decoder.\n\n5.2 The Implications of the Latent BOW Prediction\n\nDistributional Coverage. We \ufb01rst verify our model effectiveness for the multimodal BOW distribu-\ntion. Figure 3(left) shows the number of the learned modes during the training process, compared\nwith the number of target modes (number of words in the target sentences). For a single categorical\nvariable in our model, if the largest probability in the softmax is greater than 0.5, we de\ufb01ne it as\na discovered mode. The \ufb01gure shows an increasing trend of the mode discovery. In the MSCOCO\ndataset, after convergence, the number of discovered modes is less than the target modes, while in the\nQuora dataset, the model learns more modes than the target. This difference comes from the two\ndifferent aspects of these datasets. First, the MSCOCO dataset has more target sentences (4 sentences)\nthan the Quora dataset (1 sentence), which is intuitively harder to cover. Second, the MSCOCO dataset\nhas a noisier nature because the sentences are not guaranteed to be paraphrases. The words in the\n\n9Informally, if A is the metonymy of B, then A is a stand-in for B, e.g., the White House - the US government;\n\nGoogle - search engines; Facebook - social media.\n\n7\n\nQuoraInputwhydopeopleaskquestionsonquorainsteadofgooglingitNeighborpostquora quoragoogleanswerquestionsquestionssearchBOW sampleask, quora, people, questions, google, googling, easily, googled, search, answerOutputwhy do people ask questions on quora that can be easily found on a google search ?Inputhowdoitalkenglishfluently?Neighborspeakenglishfluentlybetterimproveconfidence BOW sampleenglish, speak, improve, fluently, talk, spoken, better, best, confidenceOutputhow can i improve my english speaking ?MSCOCOInputAtennisplayeriswalkingwhileholdinghisracketNeighborcourtholdingwalkscarryingcourtracketmanacrossholdsracquetBOW sampleholding, man, tennis, walking, racket, court, player, racquet, male, woman, walksOutputA man holding a tennis racquet on a tennis courtInputAbigairplaneflyinginthe blueskyNeighborlargeairplaneskyblueclearlargejetairplaneclearflyingBOW sampleblue, airplane, flying, large, plane, sky, clear, air, flies, jetOutputA large jetliner flying through a blue skyword morphologysynonymentailmentmetonymy\fFigure 3: The effectiveness of learned BOW. Left: the learned modes v.s. the average modes in\nthe bag of words from the target sentences. Our model effectively estimates the multimodal BOW\ndistribution. Right upper: BOW prediction performance and utilization. The model heavily uses the\npredicted BOW, indication the BOW prediction accuracy is essential to a good generation quality.\nRight lower: an example of corpus level word neighbors. The model learns neighbor words from\ndifferent instances across the whole training set.\n\nFigure 4: Left: adding more modeling techniques will consistently improve the model performance.\nRight: interpolating the latent space. The control of the output sentence can be achieved by explicitly\nchoose or modify the sampled bag of words (in italic).\n\ntarget sentences might not be as strongly correlated with the source. For the Quora dataset, since the\nNLL loss does not penalize modes not in the label, the model can discover the neighbor of a word\nfrom different context in multiple training instances. In \ufb01gure 3 right lower, word neighbors like\npokemon-manaply, much-spending are not in the target sentence, they are generalized from other\ninstances in the training set. In fact, this property of the NLL loss allows the model to learn the corpus\nlevel word similarity (instead of the sentence level), and results in more predicted word neighbors\nthan the BOW from one particular target sentence.\nBOW Prediction Performance and Utilization. As shown in Figure 3 (right), the precision and\nrecall of the BOW prediction is not very high (39+ recall for MSCOCO, 46+ precision for Quora). The\nsupport of the precision/ recall correspond the to number of predicted/ target modes respectively in the\nleft \ufb01gure. We notice that the decoder heavily utilizes the predicted words since more than 50% of the\ndecoder\u2019s word choices come from the BOW. If the encoder can be accurate about the prediction, the\ndecoder\u2019s search space would be more effectively restricted to the target space. This is why leaking the\nBOW information from the target sentences results in the best BLEU and ROUGE scores in Table 1.\nHowever, although not being perfect, the additional information from the encoder still provides\nmeaningful guidance, and improves the decoder\u2019s overall performance. Furthermore, our model\nis orthogonal to other techniques like conditioning the decoder\u2019s each input on the average BOW\nembedding (BOW emb), or the Copy mechanism[14] (copy). When we integrate our model with such\ntechniques that better exploit the BOW information, we see consistent performance improvement\n(Figure 4 left).\n\n5.3 Controlled Generation through the Latent Space\n\nOne advantage of latent variable models is that they allow us to control the \ufb01nal output from the\nlatent code. Figure 4 shows this property of our model. While the interpolation in previous Gaussian\n\n8\n\nPrecision# words from BOW# words from LM% BOW wordsMSCOCO59.416.7511.6657.89Quora46.996.8813.8449.71Inputwhy dopeople lovepokemongo somuchNeighborpeoplelikemanaplygoingspendinglovepok\u00e9monpokemonReferencewhat makes pok\u00e9mon go so popularDatasetAn example of corpus level word neighbors. The learned neighbors are from other training instances, not from this particular instanceBOW utilizationPerformance and utilization of the BOWPerformanceRecall39.5480.32InputA man ona motorcyclewithabirdon the handleBOW sample 1manmotorcyclesitting Output 1A manissittingonamotorcycleBOW sample 2manmotorcycleridingroadQuoraB1B2R1R2Output 2A manridingamotorcycleonadirtroadseq2seq54.640.4157.2733.04LBOW55.842.0358.7934.57InputA manwearingared tieholdingitto show peopleLBOW + BOW emb56.1642.1458.6634.36BOW sample 1mansuittieLBOW + Copy56.5342.6759.8535.30Output 1A manwearingasuitandtieBOW sample 2mansuittieholdingpictureOutput 2A manwearingasuitand tieisholdinga pictureExploit the BOW information with different components. Adding more sophisticated techniques to the BOW yields consistent improvements\fVAEs [24, 4] can only be interpreted as the arithmetic of latent vectors from a geometry perspective,\nour discrete version of interpolation can be interpreted from a lexical semantics perspective: adding,\ndeleting, or changing certain words. In the \ufb01rst example, the word sitting is changed to be riding,\nand one additional road is added. This results the \ufb01nal sentence changed from man ... sitting ... to\nman riding ... on ... road. The second example is another addition example where holding, picture\nare added to the sentence. Although not quite stable in our experiments10 , this property may induce\nfurther application potential with respect to lexical-controllable text generation.\n\n6 Conclusion\n\nThe latent BOW model serves as a bridge between the latent variable models and the planning-and-\nrealization models. The interpertability comes from the clear generation stages, while the performance\nimprovement comes from the guidance by the sampled bag of words plan. Although effective, we\n\ufb01nd out that the decoder heavily relies on the BOW prediction, yet the prediction is not as accurate.\nOn the other hand, when there exists information leakage of BOW from the target sentences, the\ndecoder can achieve signi\ufb01cantly higher performance. This indicates a future direction is to improve\nthe BOW prediction to better restrict the decoder\u2019s search space. Overall, the step by step generation\nprocess serves an move towards more interpretable generative models, and it opens new possibilities\nof controllable realization through directly injecting lexical information into the middle stages of\nsurface realization.\n\nAcknowledgments\n\nWe thank the reviewers for their detailed feedbacks and suggestions. We thank Luhuan Wu and Yang\nLiu for the meaningful discussions. This research is supported by China Scholarship Council, Sloan\nFellowship, McKnight Fellowship, NIH, and NSF.\n\nReferences\n[1] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and\nequilibrium in generative adversarial nets (gans). In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 224\u2013232. JMLR. org, 2017.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. CoRR, abs/1409.0473, 2015.\n\n[3] Igor A Bolshakov and Alexander Gelbukh. Synonymous paraphrasing using wordnet and\nIn International Conference on Application of Natural Language to Information\n\ninternet.\nSystems, pages 312\u2013323. Springer, 2004.\n\n[4] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal J\u00f3zefowicz, and Samy\n\nBengio. Generating sentences from a continuous space. In CoNLL, 2016.\n\n[5] Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Andrea Gesmundo, Neil Houlsby,\nWojciech Gajewski, and Wei Wang. Ask the right questions: Active question reformulation\nwith reinforcement learning. CoRR, abs/1705.07830, 2018.\n\n[6] Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li. Joint copying and restricted generation for\n\nparaphrase. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[7] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[8] Jihun Choi, Kang Min Yoo, and Sang goo Lee. Learning to compose task-speci\ufb01c tree structures.\n\nIn AAAI, 2018.\n\n[9] Michael Collins. Statistical machine translation: Ibm models 1 and 2.\n\n10The model does not guarantee the the input words would appear in the output, to add such constraints one\n\ncould consider constrained decoding like Grid Beam Search [19].\n\n9\n\n\f[10] Adji B. Dieng, Yoon Kim, Alexander M. Rush, and David M. Blei. Avoiding latent variable\n\ncollapse with generative skip models. CoRR, abs/1807.04863, 2018.\n\n[11] Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. Learning to paraphrase for\n\nquestion answering. In EMNLP, 2017.\n\n[12] Yao Fu. Deep generative models for natual language processing. 2018. URL https://github.\n\ncom/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing.\n\n[13] Sebastian Gehrmann, Falcon Z. Dai, Henry Elder, and Alexander M. Rush. End-to-end content\n\nand plan selection for data-to-text generation. In INLG, 2018.\n\n[14] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. Incorporating copying mechanism in\nsequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association\nfor Computational Linguistics (Volume 1: Long Papers), pages 1631\u20131640, Berlin, Germany,\nAugust 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1154. URL\nhttps://www.aclweb.org/anthology/P16-1154.\n\n[15] Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. A deep generative framework\n\nfor paraphrase generation. In AAAI, 2018.\n\n[16] Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference\nnetworks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534,\n2019.\n\n[17] Irina Higgins, Lo\u00efc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In ICLR, 2017.\n\n[18] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[19] Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation us-\nIn Proceedings of the 55th Annual Meeting of the Association for\ning grid beam search.\nComputational Linguistics (Volume 1: Long Papers), pages 1535\u20131546, Vancouver, Canada,\nJuly 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1141. URL\nhttps://www.aclweb.org/anthology/P17-1141.\n\n[20] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\nCoRR, abs/1611.01144, 2017.\n\n[21] David Kauchak and Regina Barzilay. Paraphrasing for automatic evaluation. In Proceedings\nof the main conference on Human Language Technology Conference of the North American\nChapter of the Association of Computational Linguistics, pages 455\u2013462. Association for\nComputational Linguistics, 2006.\n\n[22] Yoon Kim, Sam Wiseman, and Alexander M. Rush. A tutorial on deep latent variable models\n\nof natural language. CoRR, abs/1812.06834, 2018.\n\n[23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[25] Wouter Kool, Herke van Hoof, and Max Welling. Stochastic beams and where to \ufb01nd them:\nThe gumbel-top-k trick for sampling sequences without replacement. CoRR, abs/1903.06059,\n2019. URL http://arxiv.org/abs/1903.06059.\n\n[26] Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Paraphrase generation with deep reinforce-\n\nment learning. In EMNLP, 2018.\n\n[27] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization\n\nBranches Out, 2004.\n\n10\n\n\f[28] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James\nHays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. Microsoft coco:\nCommon objects in context. In ECCV, 2014.\n\n[29] Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle\nPineau. How not to evaluate your dialogue system: An empirical study of unsupervised\nevaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023, 2016.\n\n[30] Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. Table-to-text generation\n\nby structure-aware seq2seq learning. In AAAI, 2018.\n\n[31] Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. Bag-of-words as target for neural\n\nmachine translation. In ACL, 2018.\n\n[32] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. CoRR, abs/1611.00712, 2017.\n\n[33] Kathleen R McKeown. Paraphrasing questions using given and new information. Computational\n\nLinguistics, 9(1):1\u201310, 1983.\n\n[34] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):\n\n39\u201341, 1995.\n\n[35] Amit Moryossef, Yoav Goldberg, and Ido Dagan. Step-by-step: Separating planning from\n\nrealization in neural data-to-text generation. CoRR, abs/1904.03396, 2019.\n\n[36] Shashi Narayan, Siva Reddy, and Shay B Cohen. Paraphrase generation from latent-variable\n\npcfgs for semantic parsing. arXiv preprint arXiv:1601.06068, 2016.\n\n[37] Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser. Why we need\n\nnew evaluation metrics for nlg. In EMNLP, 2017.\n\n[38] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n[39] Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevich, Benjamin Van Durme, and Chris Callison-\nBurch. Ppdb 2.0: Better paraphrase ranking, \ufb01ne-grained entailment relations, word embeddings,\nand style classi\ufb01cation. In Association for Computational Linguistics, Beijing, China, July 2015.\nAssociation for Computational Linguistics.\n\n[40] Aaditya Prakash, Sadid A. Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey Liu, and\nOladimeji Farri. Neural paraphrase generation with stacked residual lstm networks. In COLING,\n2016.\n\n[41] Ratish Puduppully, Li Dong, and Mirella Lapata. Data-to-text generation with content selection\n\nand planning. CoRR, abs/1809.00582, 2019.\n\n[42] Lei Sha, Lili Mou, Tianyu Liu, Pascal Poupart, Sujian Li, Baobao Chang, and Zhifang Sui.\n\nOrder-planning neural text generation from structured data. In AAAI, 2018.\n\n[43] Yu Su and Xifeng Yan. Cross-domain semantic parsing via paraphrasing. arXiv preprint\n\narXiv:1704.05974, 2017.\n\n[44] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural\n\nnetworks. In NIPS, 2014.\n\n[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information\nprocessing systems, pages 5998\u20136008, 2017.\n\n[46] Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou\nChen, and Lawrence Carin. Topic-guided variational autoencoders for text generation. CoRR,\nabs/1903.07137, 2019.\n\n11\n\n\f[47] Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. No metrics are perfect:\n\nAdversarial reward learning for visual storytelling. arXiv preprint arXiv:1804.09160, 2018.\n\n[48] Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. Learning neural templates for text\n\ngeneration. In EMNLP, 2018.\n\n[49] Sang Michael Xie and Stefano Ermon. Differentiable subset sampling. CoRR, abs/1901.10517,\n\n2019. URL http://arxiv.org/abs/1901.10517.\n\n[50] Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In\n\nEMNLP, 2018.\n\n[51] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax\n\nbottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.\n\n[52] Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, and Yann LeCun. Adversarially\n\nregularized autoencoders. In ICML, 2018.\n\n[53] Zachary M. Ziegler and Alexander M. Rush. Latent normalizing \ufb02ows for discrete sequences.\n\nCoRR, abs/1901.10548, 2019.\n\n12\n\n\f", "award": [], "sourceid": 7578, "authors": [{"given_name": "Yao", "family_name": "Fu", "institution": "Columbia University"}, {"given_name": "Yansong", "family_name": "Feng", "institution": "Peking University"}, {"given_name": "John", "family_name": "Cunningham", "institution": "University of Columbia"}]}