{"title": "Content preserving text generation with attribute controls", "book": "Advances in Neural Information Processing Systems", "page_first": 5103, "page_last": 5113, "abstract": "In this work, we address the problem of modifying textual attributes of sentences. Given an input sentence and a set of attribute labels, we attempt to generate sentences that are compatible with the conditioning information. To ensure that the model generates content compatible sentences, we introduce a reconstruction loss which interpolates between auto-encoding and back-translation loss components. We propose an adversarial loss to enforce generated samples to be attribute compatible and realistic. Through quantitative, qualitative and human evaluations we demonstrate that our model is capable of generating fluent sentences that better reflect the conditioning information compared to prior methods. We further demonstrate that the model is capable of simultaneously controlling multiple attributes.", "full_text": "Content preserving text generation\n\nwith attribute controls\n\nLajanugen Logeswaran\u02da Honglak Lee:\n\nSamy Bengio:\n\n\u02daUniversity of Michigan, :Google Brain\n\nllajan@umich.edu,{honglak,bengio}@google.com\n\nAbstract\n\nIn this work, we address the problem of modifying textual attributes of sentences.\nGiven an input sentence and a set of attribute labels, we attempt to generate sen-\ntences that are compatible with the conditioning information. To ensure that the\nmodel generates content compatible sentences, we introduce a reconstruction loss\nwhich interpolates between auto-encoding and back-translation loss components.\nWe propose an adversarial loss to enforce generated samples to be attribute com-\npatible and realistic. Through quantitative, qualitative and human evaluations we\ndemonstrate that our model is capable of generating \ufb02uent sentences that better\nre\ufb02ect the conditioning information compared to prior methods. We further demon-\nstrate that the model is capable of simultaneously controlling multiple attributes.\n\n1\n\nIntroduction\n\nGenerative modeling of images and text has seen increasing progress over the last few years. Deep\ngenerative models such as variational auto-encoders [1], adversarial networks [2] and Pixel Recurrent\nNeural Nets [3] have driven most of this success in vision. Conditional generative models capable\nof providing \ufb01ne-grained control over the attributes of a generated image such as facial attributes\n[4] and attributes of birds and \ufb02owers [5] have been extensively studied. The style transfer problem\nwhich aims to change more abstract properties of an image has seen signi\ufb01cant advances [6, 7].\n\nThe discrete and sequential nature of language makes it dif\ufb01cult to approach language problems in a\nsimilar manner. Changing the value of a pixel by a small amount has negligible perceptual effect on\nan image. However, distortions to text are not imperceptible in a similar way and this has largely\nprevented the transfer of these methods to text.\n\nIn this work we consider a generative model for sentences that is capable of expressing a given\nsentence in a form that is compatible with a given set of conditioning attributes. Applications of such\nmodels include conversational systems [8], paraphrasing [9], machine translation [10], authorship\nobfuscation [11] and many others. Sequence mapping problems have been addressed successfully\nwith the sequence-to-sequence paradigm [12]. However, this approach requires training pairs of\nsource and target sentences. The lack of parallel data with pairs of similar sentences that differ along\ncertain stylistic dimensions makes this an important and challenging problem.\n\nWe focus on categorical attributes of language. Examples of such attributes include sentiment,\nlanguage complexity, tense, voice, honori\ufb01cs, mood, etc. Our approach draws inspiration from\nstyle transfer methods in the vision and language literature. We enforce content preservation using\nauto-encoding and back-translation losses. Attribute compatibility and realistic sequence generation\nare encouraged by an adversarial discriminator. The proposed adversarial discriminator is more data\nef\ufb01cient and scales better to multiple attributes with several classes more easily than prior methods.\n\nEvaluating models that address the transfer task is also quite challenging. Previous works have\nmostly focused on assessing the attribute compatibility of generated sentences. These evaluations do\nnot penalize vacuous mappings that simply generate a sentence of the desired attribute value while\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fignoring the content of the input sentence. This calls for new metrics to objectively evaluate models\nfor content preservation. In addition to evaluating attribute compatibility, we consider new metrics\nfor content preservation and generation \ufb02uency, and evaluate models using these metrics. We also\nperform a human evaluation to assess the performance of models along these dimensions.\n\nWe also take a step forward and consider a writing style transfer task for which parallel data is\navailable. Evaluating the model on parallel data assesses it in terms of all properties of interest:\ngenerating content and attribute compatible, realistic sentences. Finally, we show that the model is\nable to learn to control multiple attributes simultaneously. To our knowledge, we demonstrate the \ufb01rst\ninstance of learning to modify multiple textual attributes of a given sentence without parallel data.\n\n2 Related Work\n\nConditional Text Generation Prior work have considered controlling aspects of generated sentences\nin machine translation such as length [13], voice [14], and honori\ufb01cs/politeness [10]. Kiros et al. [15]\nuse multiplicative interactions between a word embeddings matrix and learnable attribute vectors for\nattribute conditional language modeling. Radford et al. [16] train a character-level language model\non Amazon reviews using LSTMs [17] and discover that the LSTM learns a \u2018sentiment unit\u2019. By\nclamping this unit to a \ufb01xed value, they are able to generate label conditional paragraphs.\n\nHu et al. [18] propose a generative model of sentences which can be conditioned on a sentence and\nattribute labels. The model has a VAE backbone which attempts to express holistic sentence properties\nin its latent variable. A generator reconstructs the sentence conditioned on the latent variable and\nthe conditioning attribute labels. Discriminators are used to ensure attribute compatibility. Training\nsequential VAE models has proven to be very challenging [19, 20] because of the posterior collapse\nproblem. Annealing techniques are generally used to address this issue. However, reconstructions\nfrom these models tend to differ from the input sentence.\n\nStyle Transfer Recent approaches have proposed neural models learned from non-parallel text to\naddress the text style transfer problem. Li et al. [21] propose a simple approach to perform sentiment\ntransfer and generate stylized image captions. Words that capture the stylistic properties of a given\nsentence are identi\ufb01ed and masked out, and the model attempts to reconstruct the sentence using the\nmasked version and its style information. Shen et al. [22] employ adversarial discriminators to match\nthe distribution of decoder hidden state trajectories corresponding to real and synthetic sentences\nspeci\ufb01c to a certain style. Prabhumoye et al. [23] assume that translating a sentence to a different\nlanguage alters the stylistic properties of a sentence. They adopt an adversarial training approach\nsimilar to Shen et al. [22] and replace the input sentence using a back-translated sentence obtained\nusing a machine-translation system.\n\nTo encourage generated sentences to match the conditioning stylistic attributes, prior discriminator\nbased approaches train a classi\ufb01er or adversarial discriminator speci\ufb01c to each attribute or attribute\nvalue. In contrast, our proposed adversarial loss involves learning a single discriminator which\ndetermines whether a sentence is both realistic and is compatible with a given set of attribute values.\nWe demonstrate that the model can handle multiple attributes simultaneously, while prior work has\nmostly focused on one or two attributes, which limits their practical applicability.\n\nUnsupervised Machine Translation There is growing interest in discovering latent alignments\nbetween text from multiple languages. Back-translation is an idea that is commonly used in this\ncontext where mapping from a source domain to a target domain and then mapping it back should\nproduce an identical sentence. He et al. [24] attempt to use monolingual corpora for machine\ntranslation. They learn a pair of translation models, one in each direction, and the model is trained\nvia policy gradients using reward signals coming from pre-trained language models and a back-\ntranslation constraint. Artetxe et al. [25] proposed a sequence-to-sequence model with a shared\nencoder, trained using a de-noising auto-encoding objective and an iterative back-translation based\ntraining process. Lample et al. [26] adopt a similar approach but with an unshared encoder-decoder\npair. In addition to de-noising and back-translation losses, adversarial losses are introduced to\nlearn a shared embedding space, similar to the aligned-autoencoder of Shen et al. [22]. While the\nauto-encoding loss and back-translation loss have been used to encourage content preservation in\nprior work, we identify shortcomings with these individual losses: auto-encoding prefers the copy\nsolution and back-translated samples can be noisy or incorrect. We propose a reconstruction loss\nwhich interpolates between these two losses to reduce the sensitivity of the model to these issues.\n\n2\n\n\fI will go to the airport .\n\nmood=indicative, tense=past\n\nmood=indicative, tense=present\n\nmood=subjunctive, tense=conditional\n\nI went to the airport.\n\nI am going to the airport.\n\nI would go to the airport.\n\nFigure 1: Task formulation - Given an input sentence and attributes values (Eg: indicative mood, past\ntense) generate a sentence that preserves the content of the input sentence and is compatible with the\nattribute values.\n\n3 Formulation\n\nSuppose we have K attributes of interest ta1, ..., aKu. Let be given a set of labelled sentences\nD \u201c tpxn, lnquN\nn\u201c1 where ln is a set of labels for a subset of the attributes. Given a sentence x\nand attribute values l1 \u201c pl1, ..., lKq our goal is to produce a sentence that shares the content of x,\nbut re\ufb02ects the attribute values speci\ufb01ed by l1 (\ufb01gure 1). In this context, we de\ufb01ne content as the\ninformation in the sentence that is not captured by the attributes. We use the term attribute vector to\nrefer to a binary vector representation of the attribute labels. This is a concatenation of one-hot vector\nrepresentations of the attribute labels.\n\n3.1 Model Overview\n\nWe denote the generative model by G. We want G to use the conditioning information effectively. i.e.,\nG should generate a sentence that is closely related in meaning to the input sentence and is consistent\nwith the attributes. We design G \u201c pGenc, Gdecq as an encoder-decoder model. The encoder is\nan RNN that takes the words of input sentence x as input and produces a content representation\nzx \u201c Gencpxq of the sentence. Given a set of attribute values l1, a decoder RNN generates sequence\ny \u201e pGp\u00a8|zx, l1q conditioned on zx and l1.\n\n3.2 Content compatibility\n\nWe consider two types of reconstruction losses to encourage content compatibility.\n\nAutoencoding loss Let x be a sentence and the corresponding attribute vector be l. Let zx \u201c Gencpxq\nbe the encoded representation of x. Since sentence x should have high probability under Gp\u00a8|zx, lq,\nwe enforce this constraint using an auto-encoding loss.\n\nLaepx, lq \u201c \u00b4log pGpx|zx, lq\n\n(1)\n\nBack-translation loss Consider l1, an arbitrary attribute vector different from l (i.e., corresponds\nto a different set of attribute values). Let y \u201e pGp\u00a8|zx, l1q be a generated sentence conditioned on\nx, l1. Assuming a well-trained model, the sampled sentence y will preserve the content of x. In this\ncase, sentence x should have high probability under pGp\u00a8|zy, lq where zy \u201c Gencpyq is the encoded\nrepresentation of sentence y. This requirement can be enforced in a back-translation loss as follows.\n\nLbtpx, lq \u201c \u00b4log pGpx|zy, lq\n\n(2)\n\nA common pitfall of the auto-encoding loss in auto-regressive models is that the model learns to\nsimply copy the input sequence without capturing any informative features in the latent representation.\nA de-noising formulation is often considered where noise is introduced to the input sequence by\ndeleting, swapping or re-arranging words. On the other hand, the generated sample y can be\nmismatched in content from x during the early stages of training, so that the back-translation loss can\npotentially misguide the generator. We address these issues by interpolating the latent representations\nof ground truth sentence x and generated sentence y.\n\nInterpolated reconstruction loss We merge the autoencoding and back-translation losses by fusing\nthe two latent representations zx, zy. We consider zxy \u201c g d zx ` p1 \u00b4 gq d zy, where g is a binary\nrandom vector of values sampled from a Bernoulli distribution with parameter \u0393. We de\ufb01ne a new\nreconstruction loss which uses zxy to reconstruct the original sentence.\n\nLint \u201c E\n\npx,lq\u201epdata,y\u201epGp\u00a8|zx,l1qr\u00b4log pGpx|zxy, lqs\n\n(3)\n\n3\n\n\f...\n\n...\n\n...\n\n...\n\nFigure 2: Given an input sentence x with attribute labels l, we construct an arbitrary label assignment\nl1 and sample a sentence y \u201e pGpx, l1q. Given the content representations of x and y, an interpolated\nrepresentation zxy is computed. The decoder reconstructs the input sentence using zxy and l. An\nadversarial discriminator D encourages sequence y to be both realistic and compatible with l1.\n\nNote that Lint degenerates to Lae when gi \u201c 1 @i, and to Lbt when gi \u201c 0 @i. The interpolated\ncontent embedding makes it harder for the decoder to learn trivial solutions since it cannot rely on the\noriginal sentence alone to perform the reconstruction. Furthermore, it also implicitly encourages the\ncontent representations zx and zy of x, y to be similar, which is a favorable property of the encoder.\n\n3.3 Attribute compatibility\n\nWe consider an adversarial loss which encourages generating realistic and attribute compatible\nsentences. The advesarial loss tries to match the distribution of sentence and attribute vector pairs\nps, aq where the sentence can either be a real or generated sentence. Let hx and hy be the decoder\nhidden-state sequences corresponding to x and y respectively. We consider an adversarial loss of the\nfollowing form, where D is a discriminator. Sequence hx is held constant and l1 \u2030 l.\n\nLadv \u201c min\nG\n\nmax\n\nE\n\nD\n\npx,lq\u201epdata,y\u201epGp\u00a8|zx,l1qrlogDphx, lq ` logp1 \u00b4 Dphy, l1qqs\n\n(4)\n\nIt is possible that the discriminator ignores the attributes and makes the real/fake decision based on\njust the hidden states, or vice versa. To prevent this situation, we consider additional fake pairs px, l1q\nsimilar to [5] where we consider a real sentence and a mismatched attribute vector, and encourage the\ndiscriminator to classify these pairs as fake. The new objective takes the following form.\n\nLadv \u201c min\nG\n\nmax\n\nE\n\nD\n\npx,lq\u201epdata\ny\u201epGp\u00a8|zx,l1q\n\nr2 logDphx, lq ` logp1 \u00b4 Dphy, l1qq ` logp1 \u00b4 Dphx, l1qqss\n\n(5)\n\nOur discriminator architecture follows the projection discriminator [27],\n\nDps, lq \u201c \u03c3plT\n\nv W \u03c6psq ` vT \u03c6psqq\n\n(6)\n\nwhere lv represents the binary attribute vector corresponding to l. \u03c6 is a bi-directional RNN encoder\n(\u03c6p\u00a8q represents the \ufb01nal hidden state). W, v are learnable parameters and \u03c3 is the sigmoid function.\nThe overall loss function is given by Lint ` \u03bbLadv where \u03bb is a hyperparameter.\n\n3.4 Discussion\n\nSoft-sampling and hard-sampling A challenging aspect of text generation models is dealing with\nthe discrete nature of language, which makes it dif\ufb01cult to generate a sequence and then obtain a\nlearning signal based on it. Soft-sampling is generally used to back-propagate gradients through the\nsampling process where an approximation of the sampled word vector at every time-step is used as\nthe input for the next time-step [18, 22]. Inference performs hard-sampling, where sampled words are\nused instead. Thus, when soft-sampled sequences are used at training time, the training and inference\nbehavior are mismatched. For instance, Shen et al. [22]\u2019s adversarial loss encourages the hidden-state\ndynamics of teacher-forced and soft-sampled sequences to be similar. However, there remains a gap\nbetween the dynamics of these sequences and sequences hard-sampled at test time. We eliminate this\ngap by hard-sampling the sequence y. Soft-sampling also has a tendency to introduce artifacts during\ngeneration. These approximations further become poor with large vocabulary sizes. We present an\nablative experiment comparing these two sampling strategies in Appendix C.\n\n4\n\n\fScalability to multiple attributes Shen et al. [22] use multiple class-speci\ufb01c discriminators to\nmatch the class conditional distributions of sentences. In contrast, our proposed discriminator models\nthe joint distribution of realistic sentences and corresponding attribute labels. Our approach is more\ndata-ef\ufb01cient and exploits the correlation between different attributes as well as attributes values.\n\n4 Experiments\n\nThe sentiment attribute has been widely considered in previous work [18, 22]. We \ufb01rst address the\nsentiment control task and perform a comprehensive comparison against previous methods. We\nperform quantitative, qualitative and human evaluations to compare sentences generated by different\nmodels. Next we evaluate the model in a setting where parallel data is available. Finally we consider\nthe more challenging setting of controlling multiple attributes simultaneously and show that our\nmodel easily extends to the multiple attribute scenario.\n\n4.1 Training and hyperparameters\n\nWe use the following validation metrics for model selection. The autoencoding loss Lae is used\nto measure how well the model generates content compatible sentences. Attribute compatibility\nis measured by generating sentences conditioned a set of labels, and using pre-trained attribute\nclassi\ufb01ers to measure how well the samples match the conditioning labels.\n\nFor all tasks we use a GRU (Gated Recurrent Unit [28]) RNN with hidden state size 500 as the encoder\nGenc. Attribute labels are represented as a binary vector, and an attribute embedding is constructed via\nlinear projection. The decoder Gdec is initialized using a concatenation of the representation coming\nfrom the encoder and the attribute embedding. Attribute embeddings of size 200 and a decoder GRU\nwith hidden state size 700 were used (These parameters are identical to [22]). The discriminator\nreceives an RNN hidden state sequence and an attribute vector as input. The hidden state sequence\nis encoded using a bi-directional RNN \u03c6 with hidden state size 500. The interpolation probability\n\u0393 P t0, 0.1, 0.2, .., 1.0u and weight of the adversarial loss \u03bb P t0.5, 1.0, 1.5u are chosen based on the\nvalidation metrics above. Word embeddings are initialized with pre-trained GloVe embeddings [29].\n\n4.2 Metrics\n\nAlthough the evaluation setups in prior work assess how well the generated sentences match the\nconditioning labels, they do not assess whether they match the input sentence in content. For most\nattributes of interest, parallel corpora do not exist. Hence we de\ufb01ne objective metrics that evaluate\nmodels in a setting where ground truth annotations are unavailable. While individually these metrics\nhave their de\ufb01ciencies, taken together they are helpful in objectively comparing different models and\nperforming consistent evaluations across different work.\n\nAttribute accuracy To quantitatively evaluate how well the generated samples match the condition-\ning labels we adopt a protocol similar to [18]. We generate samples from the model and measure\nlabel accuracy using a pre-trained sentiment classi\ufb01er. For the sentiment experiments, the pre-trained\nclassi\ufb01ers are CNNs trained to perform sentiment analysis at the review level on the Yelp and IMDB\ndatasets [30]. The classi\ufb01ers achieve test accuracies of 95%, 90% on the respective datasets.\n\nContent compatibility Measuring content preservation using objective metrics is challenging. Fu\net al. [31] propose a content preservation metric which extracts features from word embeddings\nand measures cosine similarity in the feature space. However, it is hard to design an embedding\nbased metric which disregards the attribute information present in the sentence. We take an indirect\napproach and measure properties that would hold if the models do indeed produce content compatible\nsentences. We consider a content preservation metric inspired by the unsupervised model selection\ncriteria of Lample et al. [26] to evaluate machine-translation models without parallel data. Given two\nnon-parallel datasets Dsrc, Dtgt and translation models Msrc\u00d1tgt, M 1\ntgt\u00d1src that map between the\ntwo domains, the following metric is de\ufb01ned.\n\nfcontentpM, M 1q \u201c 0.5rEx\u201eDsrc BLEUpx, M 1 \u02dd M pxqq ` Ex\u201eDtgt BLEUpx, M \u02dd M 1pxqqs\n\n(7)\n\nwhere M \u02dd M 1pxq represents translating x P Dsrc to domain Dtgt and then back to Dsrc. We assume\nDsrc and Dtgt to be test set sentences of positive and negative sentiment respectively and M, M 1 to be\nthe generative model conditioned on positive and negative sentiment, respectively.\n\n5\n\n\fYelp Reviews\n\nIMDB Reviews\n\nModel\n\nCtrl-gen [18]\nCross-align [22]\nOurs\n\nContent \u00d2\nAttribute \u00d2\nB-1 B-4\nAccuracy\n0.0\n11.5\n76.36%\n3.9\n90.09%\n41.9\n90.50% 53.0\n7.5\n\nContent \u00d2\nFluency \u00d3 Attribute \u00d2\nB-1 B-4\nAccuracy\n0.1\n15.4\n76.99%\n1.1\n88.68%\n31.1\n94.46% 40.3\n2.2\n\nPerp.\n156\n180\n133\n\nFluency \u00d3\n\nPerp.\n\n94\n63\n52\n\nTable 1: Quantitative evaluation for sentiment conditioned generation. Attribute compatibility\nrepresents label accuracy of generated sentences, as measured by a pre-trained classi\ufb01er. Content\npreservation is assessed based on fcontent (BLEU-1 (B-1) and BLEU-4 (B-4) scores). Fluency is\nevaluated in terms of perplexity of generated sentences as measured by a pre-trained classi\ufb01er. Higher\nnumbers are better for accuracy and content compatibility, and lower numbers are better for perplexity.\n\nModel\nCtrl-gen [18]\nCross-align [22]\nOurs\n\nAttribute Content Fluency\n\n66.0% 6.94% 2.51\n91.2% 22.04% 2.54\n92.8% 55.10% 3.19\n\nTable 2: Human assessment of sentences gener-\nated by the models. Attribute and content scores\nindicate percentage of sentences judged by hu-\nmans to have the appropriate attribute label and\ncontent respectively. Fluency scores were ob-\ntained on a 5 point Likert scale.\n\nPaired data\n\nSupervision\n\nBLEU\nModel\n10.4\nSeq2seq\nSeq2seq-bi 11.15\n7.65\nUnpaired data\nOurs\n13.89\nPaired + Unpaired data Ours\n\nTable 3: Translating Old English to Modern En-\nglish. The seq2seq models are supervised with\nparallel data. We consider our model in the unsu-\npervised (no parallel data) and semi-supervised\n(paired and unpaired data) settings.\n\nFluency We use a pre-trained language model to measure the \ufb02uency of generated sentences. The\nperplexity of generated sentences, as evaluated by the language model, is treated as a measure of\n\ufb02uency. A state-of-the-art language model trained on the Billion words benchmark [32] dataset is\nused for the evaluation.\n\n4.3 Sentiment Experiments\n\nData We use the restaurant reviews dataset from [22]. The dataset is a \ufb01ltered version of the Yelp\nreviews dataset. Similar to [18], we use the IMDB move review corpus from [33]. We use Shen\net al. [22]\u2019s \ufb01ltering process to construct a subset of the data for training and testing. The datasets\nrespectively have 447k, 300k training sentences and 128k, 36k test sentences.\n\nWe compare our model against Ctrl-gen, the VAE model of Hu et al. [18] and Cross-align, the cross\nalignment model of Shen et al. [22]. Code obtained from the authors is used to train models on the\ndatasets. We use a pre-trained model provided by [18] for movie review experiments.\n\nQuantitative evaluation Table 1 compares our model against prior work in terms of the objective\nmetrics discussed in the previous section. Both [18, 22] perform soft-decoding, so that back-\npropagation through the sampling process is made possible. But this leads to artifacts in generation,\nproducing low \ufb02uency scores. Note that the \ufb02uency scores do not represent the perplexity of the\ngenerators, but perplexity measured on generated sentences using a pre-trained language model.\nWhile the absolute numbers may not be representative of the generation quality, it serves as a useful\nmeasure for relative comparison.\n\nWe report BLEU-1 and BLEU-4 scores for the content metric. Back-translation has been effectively\nused for data augmentation in unsupervised translation approaches. The interpolation loss can be\nthought of as data augmentation in the feature space, taking into account the noisy nature of parallel\ntext produced by the model, and encourages content preservation when modifying attribute properties.\nThe cross-align model performs strongly in terms of attribute accuracy, however it has dif\ufb01culties\ngenerating grammatical text. Our model is able to outperform these methods in terms of all metrics.\n\nQualitative evaluation Table 4 shows samples generated from the models for given conditioning\nsentence and sentiment label. For each query sentence, we generate a sentence conditioned on\nthe opposite label. The Ctrl-gen model rarely produces content compatible sentences. Cross-\nalign produces relevant sentences, while parts of the sentence are ungrammatical. Our model\ngenerates sentences that are more related to the input sentence. More examples can be found in the\nsupplementary material.\n\n6\n\n\fRestaurant reviews\nnegative \u00d1 positive\n\nQuery\nCtrl gen [18]\nCross-align [22]\nOurs\n\nthe people behind the counter were not friendly whatsoever .\nthe food did n\u2019t taste as fresh as it could have been either .\nthe owners are the staff is so friendly .\nthe people at the counter were very friendly and helpful .\n\npositive \u00d1 negative\n\nQuery\n\nCtrl gen [18]\nCross-align [22]\nOurs\n\nthey do an exceptional job here , the entire staff is professional and accommo-\ndating !\nvery little water just boring ruined !\nthey do not be back here , the service is so rude and do n\u2019t care !\nthey do not care about customer service , the staff is rude and unprofessional !\n\nMovie reviews\n\nnegative \u00d1 positive\n\nQuery\nCtrl gen [18]\nCross-align [22]\nOurs\n\nonce again , in this short , there isn\u2019t much plot .\nit\u2019s perfectly executed with some idiotically amazing directing .\nbut <unk> , , the \ufb01lm is so good , it is .\n\ufb01rst off , in this \ufb01lm , there is nothing more interesting .\n\npositive \u00d1 negative\n\nQuery\nCtrl gen [18]\nCross-align [22]\nOurs\n\nthat\u2019s another interesting aspect about the \ufb01lm .\npeter was an ordinary guy and had problems we all could <unk> with\nit\u2019s the <unk> and the plot .\nthere\u2019s no redeeming qualities about the \ufb01lm .\n\nTable 4: Query sentences modi\ufb01ed with opposite sentiment by Ctrl gen [18], Cross-align [22] and\nour model, respectively.\n\nHuman evaluation We supplement the quantitative and qualitative evaluations with human assess-\nments of generated sentences. Human judges on MTurk were asked to rate the three aspects of\ngenerated sentences we are interested in - attribute compatibility, content preservation and \ufb02uency.\nWe chose 100 sentences from the test set randomly and generated corresponding sentences with the\nsame content and opposite sentiment. Attribute compatibility is assessed by asking judges to label\ngenerated sentences and comparing the opinions with the actual conditioning sentiment label. For\ncontent assessment, we ask judges whether the original and generated sentences are related by the\ndesired property (same semantic content and opposite sentiment). Fluency/grammaticality ratings\nwere obtained on a 5-point Likert scale. More details about the evaluation setup are provided in\nsection B of the appendix. Results are presented in Table 2. These ratings are in agreement with\nthe objective evaluations and indicate that samples from our model are more realistic and re\ufb02ect the\nconditioning information better than previous methods.\n\n4.4 Monolingual Translation\n\nWe next consider a style transfer experiment where we attempt to emulate a particular writing style.\nThis has been traditionally formulated as a monolingual translation problem where aligned data from\ntwo styles are used to train translation models. We consider English texts written in old English and\naddress the problem of translating between old and modern English. We used a dataset of Shakespeare\nplays crawled from the web [9]. A subset of the data has alignments between the two writing styles.\nThe aligned data was split as 17k pairs for training and 2k, 1k pairs respectively for development and\ntest. All remaining 80k sentences are considered unpaired.\n\nWe consider two sequence to sequence models as baselines. The \ufb01rst one is a simple sequence to\nsequence model that is trained to translate old to modern English. The second variation learns to\ntranslate both ways, where the decoder takes the domain of the target sentence as an additional input.\nWe compare the performance of models in Table 3. In addition to the unsupervised setting which\ndoesn\u2019t use any parallel data, we also train our model in the semi-supervised setting. In this setting\nwe \ufb01rst train the model using supervised sequence-to-sequence learning and \ufb01ne-tune on the unpaired\ndata using our objective. Our version of the model that does not use any aligned data falls short of the\nsupervised models. However, in the semi-supervised setting we observe an improvement of more\nthan 2 BLEU points over the purely supervised baselines. This shows that the model is capable of\n\ufb01nding sentence alignments by exploiting the unlabelled data.\n\n7\n\n\fMood\n\nIndicative\nIndicative\nIndicative\nIndicative\nIndicative\nIndicative\nIndicative\nIndicative\nIndicative\nIndicative\nIndicative\nIndicative\nSubjunctive\nSubjunctive\nSubjunctive\nSubjunctive\n\nTense\nPast\nPast\nPast\nPast\n\nNeg.\nVoice\nPassive\nNo\nPassive Yes\nNo\nActive\nYes\nActive\nPassive\nNo\nPresent\nPassive Yes\nPresent\nNo\nPresent Active\nYes\nPresent Active\nPassive\nFuture\nNo\nPassive Yes\nFuture\nNo\nActive\nFuture\nYes\nActive\nFuture\nPassive\nCond\nNo\nCond\nPassive Yes\nNo\nActive\nCond\nCond\nActive\nYes\n\njohn was born in the camp\njohn was born in the camp .\njohn wasn\u2019t born in the camp .\njohn had lived in the camp .\njohn didn\u2019t live in the camp .\njohn is born in the camp .\njohn isn\u2019t born in the camp .\njohn has lived in the camp .\njohn doesn\u2019t live in the camp .\njohn will be born in the camp .\njohn will not be born in the camp .\njohn will live in the camp .\njohn will not survive in the camp .\njohn could be born in the camp .\njohn couldn\u2019t live in the camp .\njohn could live in the camp .\njohn couldn\u2019t live in the camp .\n\nTable 5: Simultaneous control of multiple attributes. Generated sentences for all valid combinations\nof the input attribute values.\n\n4.5 Ablative study\n\nFigure 3 shows an ablative study of the different loss com-\nponents of the model. Each point in the plots represents the\nperformance of a model (on the validation set) during train-\ning, where we plot the attribute compatibility against con-\ntent compatibility. As training progresses, models move\nto the right. Models at the top right are desirable (high\nattribute and content compatibility). Lae and Lint refer to\nmodels trained with only the auto-encoding loss or the\ninterpolated loss respectively. We observe that the inter-\npolated reconstruction loss by itself produces a reasonable\nmodel. It augments the data with generated samples and\nacts as a regularizer. Integrating the adversarial loss Ladv\nto each of the above losses improves the attribute com-\npatibility since it explicitly requires generated sequences\nto be label compatible (and realistic). We also consider\nLae ` Lbt ` Ladv in our control experiment. While this\nmodel performs strongly, it suffers from the issues as-\nsociated with Lae and Lbt discussed in section 3.2. The\nattribute compatibility of the proposed model Lint ` Ladv\ndrops more gracefully compared to the other settings as\nthe content preservation improves.\n\n4.6 Simultaneous control of multiple attributes\n\nLae\n\nLint\n\nLae\n\n` Ladv\n\nLae\n\n` Lbt\n\n` Ladv\n\nLint\n\n` Ladv\n\n0\n0\n1\n\n0\n8\n\n0\n6\n\n0\n4\n\n)\n\n%\ny\nc\na\nr\nu\nc\nc\nA\n\n(\n\ne\nt\nu\nb\ni\nr\nt\nt\n\nA\n\n0\n\n5\n\n10\n\n15\n\nContent (BLEU)\n\nFigure 3: For different objective functions,\nwe plot attribute compatibility against con-\ntent compatibility of the learned model as\ntraining progresses. Models at the top right\nare desirable (High compatibility along both\ndimensions).\n\nIn this section we discuss experiments on simultaneously controlling multiple attributes of the input\nsentence. Given a set of sentences annotated with multiple attributes, our goal is to be able to plug\nthis data into the learning algorithm and obtain a model capable of tweaking these properties of\na sentence. Towards this end, we consider the following four attributes: tense, voice, mood and\nnegation. We use an annotation tool [34] to annotate a large corpus of sentences. We do not make\n\ufb01ne distinctions such as progressive and perfect tenses and collapse them into a single category. We\nused a subset of \u201e2M sentences from the BookCorpus dataset [15], chosen to have approximately\nnear class balance across different attributes.\n\nTable 5 shows generated sentences conditioned on all valid combinations of attribute values for\na given query sentence. We use the annotation tool to assess attribute compatibility of generated\nsentences. Attribute accuracies measured on generated senetences for mood, tense, voice, negation\nwere respectively 98%, 98%, 90%, 97%. The voice attribute is more dif\ufb01cult to control compared to\n\n8\n\n\fthe other attributes since some sentences require global changes such as switching the subject-verb-\nobject order, and we found that the model tends to distort the content during voice control.\n\n5 Conclusion\n\nIn this work we considered the problem of modifying textual attributes in sentences. We proposed a\nmodel that explicitly encourages content preservation, attribute compatibility and generating realistic\nsequences through carefully designed reconstruction and adversarial losses. We demonstrate that\nour model effectively re\ufb02ects the conditioning information through various experiments and metrics.\nWhile previous work has been centered around controlling a single attribute and transferring between\ntwo styles, the proposed model easily extends to the multiple attribute scenario. It would be interesting\nfuture work to consider attributes with continuous values in this framework and a much larger set of\nsemantic and syntactic attributes.\n\nAcknowledgements We thank Andrew Dai, Quoc Le, Xinchen Yan and Ruben Villegas for helpful\ndiscussions. We also thank Jongwook Choi, Junhyuk Oh, Kibok Lee, Seunghoon Hong, Sungryull\nSohn, Yijie Guo, Yunseok Jang and Yuting Zhang for helpful feedback on the manuscript.\n\nReferences\n\n[1] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[3] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural\n\nnetworks. arXiv preprint arXiv:1601.06759, 2016.\n\n[4] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image\ngeneration from visual attributes. In European Conference on Computer Vision, pages 776\u2013791.\nSpringer, 2016.\n\n[5] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak\n\nLee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.\n\n[6] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style.\n\narXiv preprint arXiv:1508.06576, 2015.\n\n[7] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.\n\n[8] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan.\n\nA persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016.\n\n[9] Wei Xu, Alan Ritter, William B Dolan, Ralph Grishman, and Colin Cherry. Paraphrasing for\n\nstyle. In 24th International Conference on Computational Linguistics, COLING 2012, 2012.\n\n[10] Rico Sennrich, Barry Haddow, and Alexandra Birch. Controlling politeness in neural machine\n\ntranslation via side constraints. In Proceedings of NAACL-HLT, pages 35\u201340, 2016.\n\n[11] Rakshith Shetty, Bernt Schiele, and Mario Fritz. Author attribute anonymity by adversarial\n\ntraining of neural machine translation. arXiv preprint arXiv:1711.01921, 2017.\n\n[12] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[13] Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura.\nControlling output length in neural encoder-decoders. arXiv preprint arXiv:1609.09552, 2016.\n\n[14] Hayahide Yamagishi, Shin Kanouchi, Takayuki Sato, and Mamoru Komachi. Controlling the\nvoice of a sentence in japanese-to-english neural machine translation. In Proceedings of the 3rd\nWorkshop on Asian Translation (WAT2016), pages 203\u2013210, 2016.\n\n9\n\n\f[15] Ryan Kiros, Richard Zemel, and Ruslan R Salakhutdinov. A multiplicative model for learning\ndistributed text-based attribute representations. In Advances in neural information processing\nsystems, pages 2348\u20132356, 2014.\n\n[16] Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discover-\n\ning sentiment. arXiv preprint arXiv:1704.01444, 2017.\n\n[17] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[18] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Controllable\n\ntext generation. arXiv preprint arXiv:1703.00955, 2017.\n\n[19] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy\nBengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349,\n2015.\n\n[20] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya\nSutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731,\n2016.\n\n[21] Juncen Li, Robin Jia, He He, and Percy Liang. Delete, retrieve, generate: A simple approach to\n\nsentiment and style transfer. arXiv preprint arXiv:1804.06437, 2018.\n\n[22] Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel\n\ntext by cross-alignment. arXiv preprint arXiv:1705.09655, 2017.\n\n[23] Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. Style transfer\n\nthrough back-translation. arXiv preprint arXiv:1804.09000, 2018.\n\n[24] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. Dual\nlearning for machine translation. In Advances in Neural Information Processing Systems, pages\n820\u2013828, 2016.\n\n[25] Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural\n\nmachine translation. arXiv preprint arXiv:1710.11041, 2017.\n\n[26] Guillaume Lample, Ludovic Denoyer, and Marc\u2019Aurelio Ranzato. Unsupervised machine\n\ntranslation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.\n\n[27] Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv preprint\n\narXiv:1802.05637, 2018.\n\n[28] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback\nrecurrent neural networks. In International Conference on Machine Learning, pages 2067\u20132075,\n2015.\n\n[29] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for\n\nword representation. In EMNLP, volume 14, pages 1532\u20131543, 2014.\n\n[30] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher\nPotts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting\nof the Association for Computational Linguistics: Human Language Technologies, pages 142\u2013\n150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL\nhttp://www.aclweb.org/anthology/P11-1015.\n\n[31] Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. Style transfer in text:\n\nExploration and evaluation. arXiv preprint arXiv:1711.06861, 2017.\n\n[32] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring\n\nthe limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n10\n\n\f[33] Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J Smola, Jing Jiang, and Chong\nWang. Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 193\u2013202. ACM, 2014.\n\n[34] Anita Ramm, Sharid Lo\u00e1iciga, Annemarie Friedrich, and Alexander Fraser. Annotating tense,\nmood and voice for english, french and german. Proceedings of ACL 2017, System Demonstra-\ntions, pages 1\u20136, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2456, "authors": [{"given_name": "Lajanugen", "family_name": "Logeswaran", "institution": "University of Michigan"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "Google Brain"}, {"given_name": "Samy", "family_name": "Bengio", "institution": "Google Brain"}]}