{"title": "Unsupervised Text Style Transfer using Language Models as Discriminators", "book": "Advances in Neural Information Processing Systems", "page_first": 7287, "page_last": 7298, "abstract": "Binary classifiers are employed as discriminators in GAN-based unsupervised style transfer models to ensure that transferred sentences are similar to sentences in the target domain. One difficulty with the binary discriminator is that error signal is sometimes insufficient to train the model to produce rich-structured language. In this paper, we propose a technique of using a target domain language model as the discriminator to provide richer, token-level feedback during the learning process. Because our language model scores sentences directly using a product of locally normalized probabilities, it offers more stable and more useful training signal to the generator. We train the generator to minimize the negative log likelihood (NLL) of generated sentences evaluated by a language model. By using continuous approximation of the discrete samples, our model can be trained using back-propagation in an end-to-end way. Moreover, we find empirically with a language model as a structured discriminator, it is possible to eliminate the adversarial training steps using negative samples, thus making training more stable. We compare our model with previous work using convolutional neural networks (CNNs) as discriminators and show our model outperforms them significantly in three tasks including word substitution decipherment, sentiment modification and related language translation.", "full_text": "Unsupervised Text Style Transfer using Language\n\nModels as Discriminators\n\nZichao Yang1, Zhiting Hu1, Chris Dyer2, Eric P. Xing1, Taylor Berg-Kirkpatrick1\n\n1Carnegie Mellon University, 2DeepMind\n\n{zichaoy, zhitingh, epxing, tberg}@cs.cmu.edu\n\ncdyer@google.com\n\nAbstract\n\nBinary classi\ufb01ers are often employed as discriminators in GAN-based unsupervised\nstyle transfer systems to ensure that transferred sentences are similar to sentences\nin the target domain. One dif\ufb01culty with this approach is that the error signal\nprovided by the discriminator can be unstable and is sometimes insuf\ufb01cient to train\nthe generator to produce \ufb02uent language. In this paper, we propose a new technique\nthat uses a target domain language model as the discriminator, providing richer\nand more stable token-level feedback during the learning process. We train the\ngenerator to minimize the negative log likelihood (NLL) of generated sentences,\nevaluated by the language model. By using a continuous approximation of discrete\nsampling under the generator, our model can be trained using back-propagation\nin an end-to-end fashion. Moreover, our empirical results show that when using\na language model as a structured discriminator, it is possible to forgo adversarial\nsteps during training, making the process more stable. We compare our model\nwith previous work that uses convolutional networks (CNNs) as discriminators, as\nwell as a broad set of other approaches. Results show that the proposed method\nachieves improved performance on three tasks: word substitution decipherment,\nsentiment modi\ufb01cation, and related language translation.\n\n1\n\nIntroduction\n\nRecently there has been growing interest in designing natural language generation (NLG) systems\nthat allow for control over various attributes of generated text \u2013 for example, sentiment and other\nstylistic properties. Such controllable NLG models have wide applications in dialogues systems (Wen\net al., 2016) and other natural language interfaces. Recent successes for neural text generation\nmodels in machine translation (Bahdanau et al., 2014), image captioning (Vinyals et al., 2015) and\ndialogue (Vinyals and Le, 2015; Wen et al., 2016) have relied on massive parallel data. However,\nfor many other domains, only non-parallel data \u2013 which includes collections of sentences from each\ndomain without explicit correspondence \u2013 is available. Many text style transfer problems fall into this\ncategory. The goal for these tasks is to transfer a sentence with one attribute to a sentence with an\nanother attribute, but with the same style-independent content, trained using only non-parallel data.\n\nUnsupervised text style transfer requires learning disentangled representations of attributes (e.g., nega-\ntive/positive sentiment, plaintext/ciphertext orthography) and underlying content. This is challenging\nbecause the two interact in subtle ways in natural language and it can even be hard to disentangle them\nwith parallel data. The recent development of deep generative models like variational auto-encoders\n(VAEs) (Kingma and Welling, 2013) and generative adversarial networks(GANs) (Goodfellow et al.,\n2014) have made learning disentangled representations from non-parallel data possible. However,\ndespite their rapid progress in computer vision\u2014for example, generating photo-realistic images (Rad-\nford et al., 2015), learning interpretable representations (Chen et al., 2016b), and translating im-\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fages (Zhu et al., 2017)\u2014their progress on text has been more limited. For VAEs, the problem of\ntraining collapse can severely limit effectiveness (Bowman et al., 2015; Yang et al., 2017b), and when\napplying adversarial training to natural language, the non-differentiability of discrete word tokens\nmakes generator optimization dif\ufb01cult. Hence, most attempts use REINFORCE (Sutton et al., 2000)\nto \ufb01netune trained models (Yu et al., 2017; Li et al., 2017) or uses professor forcing (Lamb et al.,\n2016) to match hidden states of decoders.\n\nPrevious work on unsupervised text style transfer (Hu et al., 2017a; Shen et al., 2017) adopts an\nencoder-decoder architecture with style discriminators to learn disentangled representations. The\nencoder takes a sentence as an input and outputs a style-independent content representation. The\nstyle-dependent decoder takes the content representation and a style representation and generates\nthe transferred sentence. Hu et al. (2017a) use a style classi\ufb01er to directly enforce the desired style\nin the generated text. Shen et al. (2017) leverage an adversarial training scheme where a binary\nCNN-based discriminator is used to evaluate whether a transferred sentence is real or fake, ensuring\nthat transferred sentences match real sentences in terms of target style. However, in practice, the\nerror signal from a binary classi\ufb01er is sometimes insuf\ufb01cient to train the generator to produce \ufb02uent\nlanguage, and optimization can be unstable as a result of the adversarial training step.\n\nWe propose to use an implicitly trained language model as a new type of discriminator, replacing the\nmore conventional binary classi\ufb01er. The language model calculates a sentence\u2019s likelihood, which\ndecomposes into a product of token-level conditional probabilities. In our approach, rather than\ntraining a binary classi\ufb01er to distinguish real and fake sentences, we train the language model to\nassign a high probability to real sentences and train the generator to produce sentences with high\nprobability under the language model. Because the language model scores sentences directly using a\nproduct of locally normalized probabilities, it may offer more stable and more useful training signal to\nthe generator. Further, by using a continuous approximation of discrete sampling under the generator,\nour model can be trained using back-propagation in an end-to-end fashion.\n\nWe \ufb01nd empirically that when using the language model as a structured discriminator, it is possible to\neliminate adversarial training steps that use negative samples\u2014a critical part of traditional adversarial\ntraining. Language models are implicitly trained to assign a low probability to negative samples\nbecause of its normalization constant. By eliminating the adversarial training step, we found the\ntraining becomes more stable in practice.\n\nTo demonstrate the effectiveness of our new approach, we conduct experiments on three tasks: word\nsubstitution decipherment, sentiment modi\ufb01cation, and related language translation. We show that\nour approach, which uses only a language model as the discriminator, outperforms a broad set of\nstate-of-the-art approaches on the three tasks.\n\n2 Unsupervised Text Style Transfer\n\nWe start by reviewing the current approaches for unsupervised text style transfer (Hu et al., 2017a;\nShen et al., 2017), and then go on to describe our approach in Section 3. Assume we have two text\ndatasets X = {x(1), x(2), . . . , x(m)} and Y = {y(1), y(2), . . . , y(n)} with two different styles vx\nand vy, respectively. For example, vx can be the positive sentiment style and vy can be the negative\nsentiment style. The datasets are non-parallel such that the data does not contain pairs of (x(i), y(j))\nthat describe the same content. The goal of style transfer is to transfer data x with style vx to style\nvy and vice versa, i.e., to estimate the conditional distribution p(y|x) and p(x|y). Since text data\nis discrete, it is hard to learn the transfer function directly via back-propagation as in computer\nvision (Zhu et al., 2017). Instead, we assume the data is generated conditioned on two disentangled\nparts, the style v and the content z1 (Hu et al., 2017a).\n\nConsider the following generative process for each style: 1) the style representation v is sampled\nfrom a prior p(v); 2) the content vector z is sampled from p(z); 3) the sentence x is generated from\nthe conditional distribution p(x|z, v). This model suggests the following parametric form for style\ntransfer where q represents a posterior:\n\np(y|x) = Zzx\n\np(y|zx, vy)q(zx|x, vx)dzx.\n\n1We drop the subscript in notations wherever the meaning is clear.\n\n2\n\n\fThe above equation suggests the use of an encoder-decoder framework for style transfer problems.\nWe can \ufb01rst encode the sentence x to get its content vector zx, then we switch the style label from vx\nto vy. Combining the content vector zx and the style label vy, we can generate a new sentence \u02dcx\n(the transferred sentences are denotes as \u02dcx and \u02dcy).\n\nOne unsupervised approach is to use the auto-encoder model. We \ufb01rst use an encoder model E to\nencode x and y to get the content vectors zx = E(x, vx) and zy = E(y, vy). Then we use a decoder\nG to generate sentences conditioned on z and v. The E and G together form an auto-encoder and\nthe reconstruction loss is:\n\nLrec(\u03b8E, \u03b8G) = Ex\u223cX[\u2212 log pG(x|zx, vx)] + Ey\u223cY[\u2212 log pG(y|zy, vy)],\n\nwhere vx and vy can be two learnable vectors to represent the label embedding. In order to make\nsure that the zx and zy capture the content and we can deliver accurate transfer between the style\nby switching the labels, we need to guarantee that zx and zy follow the same distribution. We can\nassume p(z) follows a prior distribution and add a KL-divergence regularization on zx, zy. The\nmodel then becomes a VAE. However, previous works (Bowman et al., 2015; Yang et al., 2017b)\nfound that there is a training collapse problem with the VAE for text modeling and the posterior\ndistribution of z fails to capture the content of a sentence.\n\nTo better capture the desired styles in the generated sentences, Hu et al. (2017a) additionally impose\na style classi\ufb01er on the generated samples, and the decoder G is trained to generate sentences that\nmaximize the accuracy of the style classi\ufb01er. Such additional supervision with a discriminative model\nis also adopted in (Shen et al., 2017), though in that work a binary real/fake classi\ufb01er is instead used\nwithin a conventional adversarial scheme.\n\nAdversarial Training Shen et al. (2017) use adversarial training to align the z distributions. Not\nonly do we want to align the distribution of zx and zy, but also we hope that the transferred sentence\n\u02dcx from x to resemble y and vice versa. Several adversarial discriminators are introduced to align\nthese distributions. Each of the discriminators is a binary classi\ufb01er distinguishing between real and\nfake. Speci\ufb01cally, the discriminator Dz aims to distinguish between zx and zy:\n\nLz\n\nadv(\u03b8E, \u03b8Dz ) = Ex\u223cX[\u2212 log Dz(zx)] + Ey\u223cY[\u2212 log(1 \u2212 Dz(zy))].\n\nSimilarly, Dx distinguish between x and \u02dcy, yielding an objective Lx\nadv as above; and Dy distinguish\nbetween y and \u02dcx, yielding Ly\nadv. Since the samples of \u02dcx and \u02dcy are discrete and it is hard to train the\ngenerator in an end-to-end way, professor forcing (Lamb et al., 2016) is used to match the distributions\nof the hidden states of decoders. The overall training objective is a min-max game played among the\nencoder E/decoder G and the discriminators Dz, Dx, Dy (Goodfellow et al., 2014):\n\nmin\nE,G\n\nmax\n\nDz,Dx,Dy\n\nLrec \u2212 \u03bb(Lz\n\nadv + Lx\n\nadv + Ly\n\nadv)\n\nThe model is trained in an alternating manner. In the \ufb01rst step, the loss of the discriminators are\nminimize to distinguish between the zx, x, y and zy, \u02dcx, \u02dcy, respectively; and in the second step the\nencoder and decoder are trained to minimize the reconstruction loss while maximizing loss of the\ndiscriminators.\n\n3 Language Models as Discriminators\n\nIn most past work, a classi\ufb01er is used as the discriminator to distinguish whether a sentence is real or\nfake. We propose instead to use locally-normalized language models as discriminators. We argue\nthat using an explicit language model with token-level locally normalized probabilities offers a more\ndirect training signal to the generator. If a transfered sentence does not match the target style, it will\nhave high perplexity when evaluated by a language model that was trained on target domain data.\nNot only does it provide an overall evaluation score for the whole sentence, but a language model can\nalso assign a probability to each token, thus providing more information on which word is to blame if\nthe overall perplexity is very high.\n\nThe overall model architecture is shown in Figure 1. Suppose \u02dcx is the output sentence from applying\nstyle transfer to input sentence x, i.e., \u02dcx is sampled from pG(\u02dcx|zx, vy) (and similary for \u02dcy and y). Let\npLM(x) be the probability of a sentence x evaluate against a language model, then the discriminator\n\n3\n\n\fFigure 1: The overall model architecture consists of two parts: reconstruction and transfer. For\ntransfer, we switch the style label and sample an output sentence from the generator that is evaluated\nby a language model.\n\nloss becomes:\n\nLx\nLy\n\nLM(\u03b8E, \u03b8G, \u03b8LMx ) = Ex\u223cX[\u2212 log pLMx (x))] + \u03b3E\nLM(\u03b8E, \u03b8G, \u03b8LMy ) = Ey\u223cY[\u2212 log pLMy (y))] + \u03b3E\n\ny\u223cY,\u02dcy\u223cpG(\u02dcy|zy,vx)[log pLMx (\u02dcy)],\nx\u223cX,\u02dcx\u223cpG(\u02dcx|zx,vy)[log pLMy (\u02dcx)].\n\nOur overall objective becomes:\n\nmin\nE,G\n\nmax\n\nLMx,LMy\n\nLrec \u2212 \u03bb(Lx\n\nLM + Ly\n\nLM)\n\n(1)\n\n(2)\n\n(3)\n\nNegative samples: Note that Equation 1 and 2 differs from traditional ways of training language\nmodels in that we have a term including the negative samples. We train the LM in an adversarial way\nby minimizing the loss of LM of real sentences and maximizing the loss of transferred sentences.\nHowever, since the LM is a structured discriminator, we would hope that a language model trained\non the real sentences will automatically assign high perplexity to sentences not in the target domain,\nhence negative samples from the generator may not be necessary. To investigate the necessity of\nnegative samples, we add a weight \u03b3 to the loss of negative samples. The weight \u03b3 adjusts the\nnegative sample loss in training the language models. If \u03b3 = 0, we simply train the language model\non real sentences and \ufb01x its parameters, avoiding potentially unstable adversarial training steps. We\ninvestigate the necessity of using negative samples in the experiment section.\n\nTraining consists of two steps alternatively. In the \ufb01rst step, we train the language models according\nto Equation 1 and 2. In the second step, we minimize the reconstruction loss as well as the per-\nplexity of generated samples evaluated by the language model. Since \u02dcx is discrete, one can use the\nREINFORCE (Sutton et al., 2000) algorithm to train the generator:\n\n\u2207\u03b8GLy\n\nLM = E\n\nx\u223cX,\u02dcx\u223cpG(\u02dcx|zx,vy)[log pLM(\u02dcx)\u2207\u03b8G log pG(\u02dcx|zx, vy)].\n\n(4)\n\nHowever, using a single sample to approximate the expected gradient leads to high variance in\ngradient estimates and thus unstable learning.\n\nContinuous approximation: Instead, we propose to use a continuous approximation to the sampling\nprocess in training the generator, as demonstrated in Figure 2. Instead of feeding a single sampled\nword as input to the next timestep of the generator, we use a Gumbel-softmax (Jang et al., 2016)\ndistribution as a continuous approximation to sample instead. Let u be a categorical distribution with\nprobabilities \u03c01, \u03c02, . . . , \u03c0c. Samples from u can be approximated using:\n\nexp((log \u03c0i) + gi)/\u03c4\nj=1 exp((log \u03c0j + gj)/\u03c4 )\n\n,\n\nwhere the gi\u2019s are independent samples from Gumbel(0, 1).\nLet the tokens of the transferred sentence be \u02dcx = {\u02dcxt}T\nt=1. Suppose the output of the logit at timestep\nt is vx\nt becomes\nthe one hot representation of token \u02dcxt. Using the continuous approximation, then the output of the\ndecoder becomes a sequence of probability vectors \u02dcpx = {\u02dcpx\n\nt , \u03c4 ), where \u03c4 is the temperature. When \u03c4 \u2192 0, \u02dcpx\n\nt , then \u02dcpx\n\nt = Gumbel-softmax(vx\n\nt }T\n\nt=1.\n\npi =\n\nPc\n\n4\n\n\fFigure 2: Continuous approximation of language model loss. The input is a sequence of probability\ndistributions {\u02dcpx\nt=1 sampled from the generator. At each timestep, we compute a weighted\nembedding as input to the language model and get the sequence of output distributions from the LM\nas {\u02c6px\n\nt=1. The loss is the sum of cross entropies between each pair of \u02dcpx\n\nt and \u02c6px\nt .\n\nt }T\n\nt }T\n\nWith the continuous approximation of \u02dcx, we can calculate the loss evaluated using a language model\neasily, as shown in Figure 2. For every step, we feed \u02dcpx\nt to the language model of y (denoted as\nLMy) using the weighted average of the embedding We \u02dcpx\nt , then we get the output from the LMy\nwhich is a probability distribution over the vocabulary of the next word \u02c6px\nt+1. The loss of the current\nstep is the cross entropy loss between \u02dcpx\nt+1. Note that when the decoder\noutput distribution \u02dcpx\nt+1, the above loss\nachieves minimum. By summing the loss over all steps and taking the gradient, we can use standard\nback-propagation to train the generator:\n\nt+1 aligns with the language model output distribution \u02c6px\n\nt+1)\u22ba log \u02c6px\n\nt+1 and \u02c6px\n\nt+1: (\u02dcpx\n\n\u2207\u03b8G Ly\n\nLM \u2248 E\n\nx\u223cX,\u02dcpx\u223cpG(\u02dcx|zx,vy)[\u2207\u03b8G\n\nT\n\nXt=1\n\n(\u02dcpx\n\nt )\u22ba log \u02c6px\nt ].\n\n(5)\n\nThe above Equation is a continuous approximation of Equation 4 with Gumbel softmax distribution.\nIn experiments, we use a single sample of \u02dcpx to approximate the expectation.\n\nNote that the use of the language model discriminator is a somewhat different in each of the two\ntypes of training update steps because of the continuous approximation. We use discrete samples\nfrom the generators as negative samples in training the language model discriminator step, while we\nuse a continuous approximation in updating the generator step according to Equation 5.\n\nOvercoming mode collapse: It is known that in adversarial training, the generator can suffer from\nmode collapse (Arjovsky and Bottou, 2017; Hu et al., 2017b) where the samples from the generator\nonly cover part of the data distribution. In preliminary experimentation, we found that the language\nmodel prefers short sentences. To overcome this length bias, we use two tricks in our experiments: 1)\nwe normalize the loss of Equation 5 by length and 2) we \ufb01x the length of \u02dcx to be the same of x. We\n\ufb01nd these two tricks stabilize the training and avoid generating collapsed overly short outputs.\n\n4 Experiments\n\nIn order to verify the effectiveness of our model, we experiment on three tasks: word substitution\ndecipherment, sentiment modi\ufb01cation, and related language translation. We mainly compare with the\nmost comparable approach of (Shen et al., 2017) that uses CNN classi\ufb01ers as discriminators2. Note\nthat Shen et al. (2017) use three discriminators to align both z and decoder hidden states, while our\nmodel only uses a single language model as a discriminator directly on the output sentences \u02dcx, \u02dcy.\nMoreover, we also compare with a broader set of related work (Hu et al., 2017a; Fu et al., 2017; Li\net al., 2018) for the tasks when appropriate. Our proposed model provides substantiate improvements\nin most of the cases. We implement our model with the Texar (Hu et al., 2018b) toolbox based on\nTensor\ufb02ow (Abadi et al., 2016).\n\n2We use the code from https://github.com/shentianxiao/language-style-transfer.\n\n5\n\n\fModel\n\n20% 40% 60% 80% 100%\n\nCopy\nShen et al. (2017)\u2217\n\n64.3\n86.6\n\n39.1\n77.1\n\n14.4\n70.1\n\n2.5\n61.2\n\nOur results:\nLM\nLM + adv\n\n89.0\n89.1\n\n80.0\n79.6\n\n74.1\n71.8\n\n62.9\n63.8\n\n0\n50.8\n\n49.3\n44.2\n\nTable 1: Decipherment results measured in BLEU. Copy is directly measuring y against x. LM +\nadv denotes we use negative samples to train the language model.\u2217We run the code open-sourced by\nthe authors to get the results.\n\nModel\n\nAccu BLEU PPLX PPLY\n\nShen et al. (2017)\nHu et al. (2017a)\n\n79.5\n87.7\n\n12.4\n65.6\n\n50.4\n115.6\n\n52.7\n239.8\n\nOur results:\nLM\nLM + Classi\ufb01er\n\n83.3\n91.2\n\n38.6\n57.8\n\n30.3\n47.0\n\n42.1\n60.9\n\nTable 2: Results for sentiment modi\ufb01cation. X = negative, Y = positive. PPLx denotes the\nperplexity of sentences transferred from positive sentences evaluated by a language model trained\nwith negative sentences and vice versa.\n\n4.1 Word substitution decipherment\n\nAs the \ufb01rst task, we consider the word substitution decipherment task previous explored in the NLP\nliterature (Dou and Knight, 2012). We can control the amount of change to the original sentences\nin word substitution decipherment so as to systematically investigate how well the language model\nperforms in a task that requires various amount of changes. In word substitution cipher, every token\nin the vocabulary is mapped to a cipher token and the tokens in sentences are replaced with cipher\ntokens according to the cipher dictionary. The task of decipherment is to recover the original text\nwithout any knowledge of the dictionary.\n\nData: Following (Shen et al., 2017), we sample 200K sentences from the Yelp review dataset as plain\ntext X and sample other 200K sentences and apply word substitution cipher on these sentences to get\nY. We use another 100k parallel sentences as the development and test set respectively. Sentences of\nlength more than 15 are \ufb01ltered out. We keep all words that appear more than 5 times in the training\nset and get a vocabulary size of about 10k. All words appearing less than 5 times are replaced with a\n\u201c<unk>\u201d token. We random sample words from the vocabulary and replace them with cipher tokens.\nThe amount of ciphered words ranges from 20% to 100%. As we have ground truth plain text, we\ncan directly measure the BLEU 3 score to evaluate the model. Our model con\ufb01gurations are included\nin Appendix B.\n\nResults: The results are shown in Table 1. We \ufb01rst investigate the effect of using negative samples in\ntraining the language model, as denotes by LM + adv in Table 1. We can see that using adversarial\ntraining sometimes improves the results. However, we found empirically that using negative samples\nmakes the training very unstable and the model diverges easily. This is the main reason why we did\nnot get consistently better results by incorporating adversarial training.\n\nComparing with (Shen et al., 2017), we can see that the language model without adversarial training\nis already very effective and performs much better when the amount of change is less than 100%. This\nis intuitive because when the change is less than 100%, a language model can use context information\nto predict and correct enciphered tokens. It\u2019s surprising that even with 100% token change, our model\nis only 1.5 BLEU score worse than (Shen et al., 2017), when all tokens are replaced and no context\ninformation can be used by the language model. We guess our model can gradually decipher tokens\nfrom the beginning of a sentence and then use them as a bootstrap to decipher the whole sentence.\nWe can also combine language models with the CNNs as discriminators. For example, for the 100%\n\n3BLEU score is measured with multi-bleu.perl.\n\n6\n\n\fcase, we get BLEU score of 52.1 when combing them. Given unstableness of adversarial training and\neffectiveness of language models, we set \u03b3 = 0 in Equation 1 and 2 in the rest of the experiments.\n\n4.2 Sentiment Manipulation\n\nWe have demonstrated that the language model can successfully crack word substitution cipher.\nHowever, the change of substitution cipher is limited to a one-to-one mapping. As the second task,\nwe would like to investigate whether a language model can distinguish sentences with positive and\nnegative sentiments, thus help to transfer the sentiments of sentences while preserving the content.\nWe compare to the model of (Hu et al., 2017a) as an additional baseline, which uses a pre-trained\nclassi\ufb01er as guidance.\n\nData: We use the same data set as in (Shen et al., 2017). The data set contains 250K negative\nsentences (denoted as X) and 380K positive sentences (denoted as Y), of which 70% are used for\ntraining, 10% are used for development and the remaining 20% are used as test set. The pre-processing\nsteps are the same as the previous experiment. We also use similar experiment con\ufb01gurations.\n\nEvaluation: Evaluating the quality of transferred sentences is a challenging problem as there are no\nground truth sentences. We follow previous papers in using model-based evaluation. We measure\nwhether transferred sentences have the correct sentiment according to a pre-trained sentiment classi\ufb01er.\nWe follow both (Hu et al., 2017a) and (Shen et al., 2017) in using a CNN-based classi\ufb01er. However,\nsimply evaluating the sentiment of sentences is not enough since the model can output collapsed\noutput such as a single word \u201cgood\u201d for all negative transfer and \u201cbad\u201d for all positive transfer. We\nnot only would like transferred sentences to preserve the content of original sentences, but also to\nbe smooth in terms of language quality. For these two aspects, we propose to measure the BLEU\nscore of transferred sentences against original sentences and measure the perplexity of transferred\nsentences to evaluate the \ufb02uency. A good model should perform well on all three metrics.\n\nResults: We report the results in Table. 2. As a baseline, the original corpus has perplexity of 35.8\nand 38.8 for the negative and positive sentences respectively. Comparing LM with (Shen et al.,\n2017), we can see that LM outperforms it in all three aspects: getting higher accuracy, preserving\nthe content better while being more \ufb02uent. This demonstrates the effectiveness of using LM as the\ndiscriminator. (Hu et al., 2017a) has the highest accuracy and BLEU score among the three models\nwhile the perplexity is very high. It is not surprising that the classi\ufb01er will only modify the features of\nthe sentences that are related to the sentiment and there is no mechanism to ensure that the modi\ufb01ed\nsentence being \ufb02uent. Hence the corresponding perplexity is very high. We can manifest the best of\nboth models by combing the loss of LM and the classi\ufb01er in (Hu et al., 2017a): a classi\ufb01er is good at\nmodifying the sentiment and an LM can smooth the modi\ufb01cation to get a \ufb02uent sentence. We \ufb01nd\nimprovement of accuracy and perplexity as denoted by LM + classi\ufb01er compared to classi\ufb01er only\n(Hu et al., 2017a).\n\nComparing with other models: Recently there are other models that are proposed speci\ufb01cally\ntargeting the sentiment modi\ufb01cation task such as (Li et al., 2018). Their method is feature based and\nconsists of the following steps: (Delete) \ufb01rst, they use the statistics of word frequency to delete the\nattribute words such as \u201cgood, bad\u201d from original sentences, (Retrieve) then they retrieve the most\nsimilar sentences from the other corpus based on nearest neighbor search, (Generate) the attribute\nwords from retrieved sentences are combined with the content words of original sentences to generate\ntransferred sentences. The authors provide 500 human annotated sentences as the ground truth of\ntransferred sentences so we measure the BLEU score against those sentences. The results are shown\nin Table 3. We can see our model has similar accuracy compared with DeleteAndRetrieve, but has\nmuch better BLEU scores and slightly better perplexity.\n\nWe list some examples of transferred sentences in Table 5 in the appendix. We can see that (Shen\net al., 2017) does not keep the content of the original sentences well and changes the meaning\nof the original sentences. (Hu et al., 2017a) changes the sentiment but uses improper words, e.g.\n\u201cmaintenance is equally hilarious\u201d. Our LM can change the change the sentiment of sentences. But\nsometimes there is an over-smoothing problem, changing the less frequent words to more frequent\nwords, e.g. changing \u201cmy goodness it was so gross\u201d to \u201cmy food it was so good.\u201d. In general LM +\nclassi\ufb01er has the best results, it changes the sentiment, while keeps the content and the sentences are\n\ufb02uent.\n\n7\n\n\fModel\nShen et al. (2017)\nFu et al. (2017):\nStyleEmbedding\nMultiDecoder\n\nLi et al. (2018):\nDelete\nTemplate\nRetrieval\nDeleteAndRetrieval\n\nOur results:\nLM\nLM + Classi\ufb01er\n\nACCU BLEU PPLX PPLY\n45.6\n\n76.2\n\n6.8\n\n49.4\n\n9.2\n50.9\n\n16.65\n11.24\n\n97.51\n111.1\n\n142.6\n119.1\n\n87.2\n86.7\n95.1\n90.9\n\n85.4\n90.0\n\n11.5\n18.0\n1.3\n12.6\n\n13.4\n22.3\n\n75.2\n192.5\n31.5\n104.6\n\n68.7\n148.4\n37.0\n43.8\n\n32.8\n48.4\n\n40.5\n61.6\n\nTable 3: Results for sentiment modi\ufb01cation based on the 500 human annotated sentences as ground\ntruth from (Li et al., 2018).\n\n4.3 Related language translation\n\nIn the \ufb01nal experiment, we consider a more challenging task: unsupervised related language trans-\nlation (Pourdamghani and Knight, 2017). Related language translation is easier than normal pair\nlanguage translation since there is a close relationship between the two languages. Note here we\ndon\u2019t compare with other sophisticated unsupervised neural machine translation systems such as\n(Lample et al., 2017; Artetxe et al., 2017), whose models are much more complicated and use other\ntechniques such as back-translation, but simply compare the different type of discriminators in the\ncontext of a simple model.\n\nData: We choose Bosnian (bs) vs Serbian (sr) and simpli\ufb01ed Chinese (zh-CN) vs traditional Chinese\n(zh-TW) pair as our experiment languages. Due to the lack of parallel data for these data, we build the\ndata ourselves. For bs and sr pair, we use the monolingual data from Leipzig Corpora Collections4.\nWe use the news data and sample about 200k sentences of length less than 20 for each language,\nof which 80% are used for training, 10% are used for validation and remaining 10% are used for\ntest. For validation and test, we obtain the parallel corpus by using the Google Translation API.\nThe vocabulary size is 25k for the sr vs bs language pair. For zh-CN and zh-TW pair, we use the\nmonolingual data from the Chinese Gigaword corpus. We use the news headlines as our training data.\n300k sentences are sampled for each language. The data is partitioned and parallel data is obtained in\na similar way to that of sr vs bs pair. We directly use a character-based model and the total vocabulary\nsize is about 5k. For evaluation, we directly measure the BLEU score using the references for both\nlanguage pairs.\n\nNote that the relationship between zh-CN and zh-TW is simple and mostly like a decipherment\nproblem in which some simpli\ufb01ed Chinese characters have the corresponding traditional character\nmapping. The relation between bs vs sr is more complicated.\n\nResults: The results are shown in Table. 4. For sr\u2013bos and bos\u2013sr, since the vocabulary of two\nlanguages does not overlap at all, it is a very challenging task. We report the BLEU1 metric since\nthe BLEU4 is close to 0. We can see that our language model discriminator still outperforms (Shen\net al., 2017) slightly. The case for zh\u2013tw and tw\u2013zh is much easier. Simple copying already has a\nreasonable score of 32.3. Using our model, we can improve it to 81.6 for cn\u2013tw and 85.5 for tw\u2013cn,\noutperforming (Shen et al., 2017) by a large margin.\n\n5 Related Work\n\nNon-parallel transfer in natural language: (Hu et al., 2017a; Shen et al., 2017; Prabhumoye et al.,\n2018; Gomez et al., 2018) are most relevant to our work. Hu et al. (2017a) aim to generate sentences\nwith controllable attributes by learning disentangled representations. Shen et al. (2017) introduce\nadversarial training to unsupervised text style transfer. They apply discriminators both on the encoder\n\n4http://wortschatz.uni-leipzig.de/en\n\n8\n\n\fModel\n\nsr\u2013bs\n\nbs\u2013sr\n\ncn\u2013tw tw\u2013cn\n\nCopy\nShen et al. (2017)\n\n0\n29.1\n\n0\n30.3\n\n32.3\n60.1\n\n32.3\n60.7\n\nOur results:\nLM\n\n31.0\n\n31.7\n\n81.6\n\n85.5\n\nTable 4: Related language translation results measured in BLEU. The results for sr vs bs in measured\nin BLEU1 while cn vs tw is measure in BLEU.\n\nrepresentation and on the hidden states of the decoders to ensure that they have the same distribution.\nThese are the two models that we mainly compare with. Prabhumoye et al. (2018) use the back-\ntranslation technique in their model, which is complementary to our method and can be integrated\ninto our model to further improve performance. Gomez et al. (2018) use GAN-based approach to\ndecipher shift ciphers. (Lample et al., 2017; Artetxe et al., 2017) propose unsupervised machine\ntranslation and use adversarial training to match the encoder representation of the sentences from\ndifferent languages. They also use back-translation to re\ufb01ne their model in an iterative way.\n\nGANs: GANs have been widely explored recently, especially in computer vision (Zhu et al., 2017;\nChen et al., 2016b; Radford et al., 2015; Sutton et al., 2000; Salimans et al., 2016; Denton et al., 2015;\nIsola et al., 2017). The progress of GANs on text is relatively limited due to the non-differentiable\ndiscrete tokens. Lots of papers (Yu et al., 2017; Che et al., 2017; Li et al., 2017; Yang et al., 2017a)\nuse REINFORCE (Sutton et al., 2000) to \ufb01netune a trained model to improve the quality of samples.\nThere is also prior work that attempts to introduce more structured discriminators, for instance, the\nenergy-based GAN (EBGAN) (Zhao et al., 2016) and RankGAN (Lin et al., 2017). Our language\nmodel can be seen as a special energy function, but it is more complicated than the auto-encoder\nused in (Zhao et al., 2016) since it has a recurrent structure. Hu et al. (2018a) also proposes to\nuse structured discriminators in generative models and establishes its the connection with posterior\nregularization.\n\nComputer vision style transfer: Our work is also related to unsupervised style transfer in computer\nvision (Gatys et al., 2016; Huang and Belongie, 2017). (Gatys et al., 2016) directly uses the covariance\nmatrix of the CNN features and tries to align the covariance matrix to transfer the style. (Huang\nand Belongie, 2017) proposes adaptive instance normalization for an arbitrary style of images. (Zhu\net al., 2017) uses a cycle-consistency loss to ensure the content of the images is preserved and can be\ntranslated back to original images.\n\nLanguage model for reranking: Previously, language models are used to incorporate the knowledge\nof monolingual data mainly by reranking the sentences generated from a base model such as (Brants\net al., 2007; Gulcehre et al., 2015; He et al., 2016). (Liu et al., 2017; Chen et al., 2016a) use a\nlanguage model as training supervision for unsupervised OCR. Our model is more advanced in using\nlanguage models as discriminators in distilling the knowledge of monolingual data to a base model in\nan end-to-end way.\n\n6 Conclusion\n\nWe showed that by using language models as discriminators and we could outperform traditional\nbinary classi\ufb01er discriminators in three unsupervised text style transfer tasks including word substitu-\ntion decipherment, sentiment modi\ufb01cation and related language translation. In comparison with a\nbinary classi\ufb01er discriminator, a language model can provide a more stable and more informative\ntraining signal for training generators. Moreover, we empirically found that it is possible to eliminate\nadversarial training with negative samples if a structured model is used as the discriminator, thus\npointing one possible direction to solve the training dif\ufb01culty of GANs. In the future, we plan to\nexplore and extend our model to semi-supervised learning.\n\nReferences\n\nM. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In OSDI, volume 16, pages\n\n9\n\n\f265\u2013283, 2016.\n\nM. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks.\n\narXiv preprint arXiv:1701.04862, 2017.\n\nM. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. arXiv\n\npreprint arXiv:1710.11041, 2017.\n\nD. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and\n\ntranslate. arXiv preprint arXiv:1409.0473, 2014.\n\nS. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences\n\nfrom a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\nT. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In\nProceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing\nand Computational Natural Language Learning (EMNLP-CoNLL), 2007.\n\nT. Che, Y. Li, R. Zhang, R. D. Hjelm, W. Li, Y. Song, and Y. Bengio. Maximum-likelihood augmented\n\ndiscrete generative adversarial networks. arXiv preprint arXiv:1702.07983, 2017.\n\nJ. Chen, P.-S. Huang, X. He, J. Gao, and L. Deng. Unsupervised learning of predictors from unpaired\n\ninput-output samples. arXiv preprint arXiv:1606.04646, 2016a.\n\nX. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In Advances in\nNeural Information Processing Systems, pages 2172\u20132180, 2016b.\n\nJ. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural\n\nnetworks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\nE. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid\nof adversarial networks. In Advances in neural information processing systems, pages 1486\u20131494,\n2015.\n\nQ. Dou and K. Knight. Large scale decipherment for out-of-domain machine translation.\n\nIn\nProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing\nand Computational Natural Language Learning, pages 266\u2013275. Association for Computational\nLinguistics, 2012.\n\nZ. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan. Style transfer in text: Exploration and evaluation. arXiv\n\npreprint arXiv:1711.06861, 2017.\n\nL. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In\nComputer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414\u20132423.\nIEEE, 2016.\n\nA. N. Gomez, S. Huang, I. Zhang, B. M. Li, M. Osama, and L. Kaiser. Unsupervised cipher cracking\n\nusing discrete gans. arXiv preprint arXiv:1801.04883, 2018.\n\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\nC. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio.\nOn using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535,\n2015.\n\nD. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine translation.\n\nIn Advances in Neural Information Processing Systems, pages 820\u2013828, 2016.\n\nZ. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. In\n\nInternational Conference on Machine Learning, pages 1587\u20131596, 2017a.\n\n10\n\n\fZ. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing. On unifying deep generative models. arXiv\n\npreprint arXiv:1706.00550, 2017b.\n\nZ. Hu, Z. Yang, R. Salakhutdinov, X. Liang, L. Qin, H. Dong, and E. Xing. Deep generative models\n\nwith learnable knowledge constraints. arXiv preprint arXiv:1806.09764, 2018a.\n\nZ. Hu, Z. Yang, T. Zhao, H. Shi, J. He, D. Wang, X. Ma, Z. Liu, X. Liang, L. Qin, et al. Texar: A\nmodularized, versatile, and extensible toolbox for text generation. In Proceedings of Workshop for\nNLP Open Source Software (NLP-OSS), pages 13\u201322, 2018b.\n\nX. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization.\n\nCoRR, abs/1703.06868, 2017.\n\nP. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial\n\nnetworks. arXiv preprint, 2017.\n\nE. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint\n\narXiv:1611.01144, 2016.\n\nD. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\nD. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\nA. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio. Professor\nforcing: A new algorithm for training recurrent networks. In Advances In Neural Information\nProcessing Systems, pages 4601\u20134609, 2016.\n\nG. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual\n\ncorpora only. arXiv preprint arXiv:1711.00043, 2017.\n\nJ. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue\n\ngeneration. arXiv preprint arXiv:1701.06547, 2017.\n\nJ. Li, R. Jia, H. He, and P. Liang. Delete, retrieve, generate: A simple approach to sentiment and\n\nstyle transfer. arXiv preprint arXiv:1804.06437, 2018.\n\nK. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun. Adversarial ranking for language generation. In\n\nAdvances in Neural Information Processing Systems, pages 3155\u20133165, 2017.\n\nY. Liu, J. Chen, and L. Deng. Unsupervised sequence classi\ufb01cation using sequential output statistics.\n\nIn Advances in Neural Information Processing Systems, pages 3550\u20133559, 2017.\n\nN. Pourdamghani and K. Knight. Deciphering related languages.\n\nIn Proceedings of the 2017\n\nConference on Empirical Methods in Natural Language Processing, pages 2513\u20132518, 2017.\n\nS. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black. Style transfer through back-\n\ntranslation. arXiv preprint arXiv:1804.09000, 2018.\n\nA. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\nT. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques\nfor training gans. In Advances in Neural Information Processing Systems, pages 2234\u20132242, 2016.\n\nT. Shen, T. Lei, R. Barzilay, and T. Jaakkola. Style transfer from non-parallel text by cross-alignment.\n\nIn Advances in Neural Information Processing Systems, pages 6833\u20136844, 2017.\n\nR. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement\nlearning with function approximation. In Advances in neural information processing systems,\npages 1057\u20131063, 2000.\n\nO. Vinyals and Q. Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.\n\n11\n\n\fO. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\nComputer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3156\u20133164.\nIEEE, 2015.\n\nT.-H. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas-Barahona, P.-H. Su, S. Ultes, and\nS. Young. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint\narXiv:1604.04562, 2016.\n\nZ. Yang, W. Chen, F. Wang, and B. Xu. Improving neural machine translation with conditional\n\nsequence generative adversarial nets. arXiv preprint arXiv:1703.04887, 2017a.\n\nZ. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick. Improved variational autoencoders for\n\ntext modeling using dilated convolutions. arXiv preprint arXiv:1702.08139, 2017b.\n\nL. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy\n\ngradient. In AAAI, pages 2852\u20132858, 2017.\n\nJ. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprint\n\narXiv:1609.03126, 2016.\n\nJ.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-\n\nconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3632, "authors": [{"given_name": "Zichao", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "Zhiting", "family_name": "Hu", "institution": "Carnegie Mellon University"}, {"given_name": "Chris", "family_name": "Dyer", "institution": "DeepMind"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. /  Carnegie Mellon University"}, {"given_name": "Taylor", "family_name": "Berg-Kirkpatrick", "institution": "Carnegie Mellon University"}]}