{"title": "Adversarial Text Generation via Feature-Mover's Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 4666, "page_last": 4677, "abstract": "Generative adversarial networks (GANs) have achieved significant success in generating real-valued data. However, the discrete nature of text hinders the application of GAN to text-generation tasks. Instead of using the standard GAN objective, we propose to improve text-generation GAN via a novel approach inspired by optimal transport. Specifically, we consider matching the latent feature distributions of real and synthetic sentences using a novel metric, termed the feature-mover's distance (FMD). This formulation leads to a highly discriminative critic and easy-to-optimize objective, overcoming the mode-collapsing and brittle-training problems in existing methods. Extensive experiments are conducted on a variety of tasks to evaluate the proposed model empirically, including unconditional text generation, style transfer from non-parallel text, and unsupervised cipher cracking. The proposed model yields superior performance, demonstrating wide applicability and effectiveness.", "full_text": "Adversarial Text Generation via\n\nFeature-Mover\u2019s Distance\n\nLiqun Chen1, Shuyang Dai1, Chenyang Tao1, Dinghan Shen1,\nZhe Gan2, Haichao Zhang4, Yizhe Zhang3, Lawrence Carin1\n\n1Duke University, 2Microsoft Dynamics 365 AI Research, 3Microsoft Research, 4Baidu Research\n\nliqun.chen@duke.edu\n\nAbstract\n\nGenerative adversarial networks (GANs) have achieved signi\ufb01cant success in\ngenerating real-valued data. However, the discrete nature of text hinders the\napplication of GAN to text-generation tasks. Instead of using the standard GAN\nobjective, we propose to improve text-generation GAN via a novel approach\ninspired by optimal transport. Speci\ufb01cally, we consider matching the latent feature\ndistributions of real and synthetic sentences using a novel metric, termed the feature-\nmover\u2019s distance (FMD). This formulation leads to a highly discriminative critic and\neasy-to-optimize objective, overcoming the mode-collapsing and brittle-training\nproblems in existing methods. Extensive experiments are conducted on a variety\nof tasks to evaluate the proposed model empirically, including unconditional text\ngeneration, style transfer from non-parallel text, and unsupervised cipher cracking.\nThe proposed model yields superior performance, demonstrating wide applicability\nand effectiveness.\n\n1\n\nIntroduction\n\nNatural language generation is an important building block in many applications, such as machine\ntranslation [5], dialogue generation [36], and image captioning [14]. While these applications\ndemonstrate the practical value of generating coherent and meaningful sentences in a supervised\nsetup, unsupervised text generation, which aims to estimate the distribution of real text from a\ncorpus, is still challenging. Previous approaches, that often maximize the log-likelihood of each\nground-truth word given prior observed words [41], typically suffer from exposure bias [6, 47], i.e.,\nthe discrepancy between training and inference stages. During inference, each word is generated in\nsequence based on previously generated words, while during training ground-truth words are used for\neach timestep [27, 53, 58].\nRecently, adversarial training has emerged as a powerful paradigm to address the aforementioned\nissues. The generative adversarial network (GAN) [21] matches the distribution of synthetic and\nreal data by introducing a two-player adversarial game between a generator and a discriminator.\nThe generator is trained to learn a nonlinear function that maps samples from a given (simple) prior\ndistribution to synthetic data that appear realistic, while the discriminator aims to distinguish the fake\ndata from real samples. GAN can be trained ef\ufb01ciently via back-propagation through the nonlinear\nfunction of the generator, which typically requires the data to be continuous (e.g., images). However,\nthe discrete nature of text renders the model non-differentiable, hindering use of GAN in natural\nlanguage processing tasks.\nAttempts have been made to overcome such dif\ufb01culties, which can be roughly divided into two\ncategories. The \ufb01rst includes models that combine ideas from GAN and reinforcement learning\n(RL), framing text generation as a sequential decision-making process. Speci\ufb01cally, the gradient\nof the generator is estimated via the policy-gradient algorithm. Prominent examples from this\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcategory include SeqGAN [60], MaliGAN [8], RankGAN [37], LeakGAN [24] and MaskGAN [15].\nDespite the promising performance of these approaches, one major disadvantage with such RL-based\nstrategies is that they typically yield high-variance gradient estimates, known to be challenging for\noptimization [40, 61].\nModels from the second category adopt the original framework of GAN without incorporating\nthe RL methods (i.e., RL-free). Distinct from RL-based approaches, TextGAN [61] and Gumbel-\nSoftmax GAN (GSGAN) [31] apply a simple soft-argmax operator, and a similar Gumbel-softmax\ntrick [28, 40], respectively, to provide a continuous approximation of the discrete distribution (i.e.,\nmultinomial) on text, so that the model is still end-to-end differentiable. What makes this approach\nappealing is that it feeds the optimizer with low-variance gradients, improving stability and speed of\ntraining. In this work, we aim to improve the training of GAN that resides in this category.\nWhen training GAN to generate text samples, one practical challenge is that the gradient from the\ndiscriminator often vanishes after being trained for only a few iterations. That is, the discriminator\ncan easily distinguish the fake sentences from the real ones. TextGAN [61] proposed a remedy based\non feature matching [49], adding Maximum Mean Discrepancy (MMD) to the original objective of\nGAN [22]. However, in practice, the model is still dif\ufb01cult to train. Speci\ufb01cally, (i) the bandwidth\nof the RBF kernel is dif\ufb01cult to choose; (ii) kernel methods often suffer from poor scaling; and\n(iii) empirically, TextGAN tends to generate short sentences.\nIn this work, we present feature mover GAN (FM-GAN), a novel adversarial approach that leverages\noptimal transport (OT) to construct a new model for text generation. Speci\ufb01cally, OT considers\nthe problem of optimally transporting one set of data points to another, and is closely related to\nGAN. The earth-mover\u2019s distance (EMD) is employed often as a metric for the OT problem. In our\nsetting, a variant of the EMD between the feature distributions of real and synthetic sentences is\nproposed as the new objective, denoted as the feature-mover\u2019s distance (FMD). In this adversarial\ngame, the discriminator aims to maximize the dissimilarity of the feature distributions based on the\nFMD, while the generator is trained to minimize the FMD by synthesizing more-realistic text. In\npractice, the FMD is turned into a differentiable quantity and can be computed using the proximal\npoint method [59].\nThe main contributions of this paper are as follows: (i) A new GAN model based on optimal transport\nis proposed for text generation. The proposed model is RL-free, and uses a so-called feature-mover\u2019s\ndistance as the objective. (ii) We evaluate our model comprehensively on unconditional text generation.\nWhen compared with previous methods, our model shows a substantial improvement in terms of\ngeneration quality based on the BLEU statistics [43] and human evaluation. Further, our model also\nachieves good generation diversity based on the self-BLEU statistics [63]. (iii) In order to demonstrate\nthe versatility of the proposed method, we also generalize our model to conditional-generation tasks,\nincluding non-parallel text style transfer [54], and unsupervised cipher cracking [20].\n\n2 Background\n\n2.1 Adversarial training for distribution matching\n\nWe review the basic idea of adversarial distribution matching (ADM), which avoids the speci\ufb01cation\nof a likelihood function. Instead, this strategy de\ufb01nes draws from the synthetic data distribution\npG(x) by drawing a latent code z \u223c p(z) from an easily sampled distribution p(z), and learning a\ngenerator function G(z) such that x = G(z). The form of pG(x) is neither speci\ufb01ed nor learned,\nrather we learn to draw samples from pG(x). To match the ensemble of draws from pG(x) with\nan ensemble of draws from the real data distribution pd(x), ADM introduces a variational function\nV(pd, pG; D), where D(x) is known as the critic function or discriminator. The goal of ADM is to\nobtain an equilibrium of the following objective:\n\nmax\n\n(1)\nwhere V(pd, pG; D) is computed using samples from pd and pG (not explicitly in terms of the dis-\ntributions themselves), and d(pd, pG) = maxD V(pd, pG; D) de\ufb01nes a discrepancy metric between\ntwo distributions [3, 42]. One popular example of ADM is the generative adversarial network (GAN),\nin which VJSD = Ex\u223cpd(x) log D(x) + Ez\u223cp(z) log[1 \u2212 D(G(z))] recovers the Jensen-Shannon\ndivergence (JSD) for d(pd, pG) [21]; expectations Ex\u223cpd(x)(\u00b7) and Ez\u223cp(z)(\u00b7) are computed approx-\nimately with samples from the respective distributions. Most of the existing work in applying GAN\n\nD\n\nmin\n\nG\n\nV(pd, pG; D) ,\n\n2\n\n\fReal Data\n\nLatent Space\n\nGenerated Data\n\nEmbedding Matrix\n\nFeature Space\n\nFigure 1: Illustration of the proposed feature mover GAN (FM-GAN) for text generation.\n\nCost Matrix\n\nTransport Matrix\n\nfor text generation also uses this standard form, by combining it with policy gradient [60]. However,\nit has been shown in [2] that this standard GAN objective suffers from an unstably weak learning\nsignal when the discriminator gets close to local optimal, due to the gradient-vanishing effect. This is\nbecause the JSD implied by the original GAN loss is not continuous wrt the generator parameters.\n\n2.2 Sentence to feature\n\nGAN models were originally developed for learning to draw from a continuous distribution. The\ndiscrete nature of text samples hinders the use of GANs, and thus a vectorization of a sequence\nof discrete tokens is considered. Let x = {s1, ..., sL} \u2208 Rv\u00d7L be a sentence of length L, where\nst \u2208 Rv denotes the one-hot representation for the t-th word. A word-level vector representation\nof each word in x is achieved by learning a word embedding matrix We \u2208 Rk\u00d7v, where v is the\nsize of the vocabulary. Each word is represented as a k-dimensional vector wt = West \u2208 Rk. The\nsentence x is now represented as a matrix W = [w1, ..., wL] \u2208 Rk\u00d7L. A neural network F (\u00b7),\nsuch as RNN [5, 10], CNN [18, 29, 52] or SWEM [51], can then be applied to extract feature vector\nf = F (W).\n\n2.3 Optimal transport\n\nGAN can be interpreted in the framework of optimal transport theory, and it has been shown that the\nEarth-Mover\u2019s Distance (EMD) is a good objective for generative modeling [3]. Originally applied\nin content-based image retrieval tasks [48], EMD is well-known for comparing multidimensional\ndistributions that are used to describe the different features of images (e.g., brightness, color, and\ntexture content). It is de\ufb01ned as the ground distance (i.e., cost function) between every two perceptual\nfeatures, extending the notion of a distance between single elements to a distance between sets of\nelements. Speci\ufb01cally, consider two probability distribution x \u223c \u00b5 and y \u223c \u03bd; EMD can be then\nde\ufb01ned as:\n(2)\n\nE\n(x,y)\u223c\u03b3 c(x, y) ,\n\nDEMD(\u00b5, \u03bd) = inf\n\n\u03b3\u2208\u03a0(\u00b5,\u03bd)\n\nwhere \u03a0(\u00b5, \u03bd) denotes the set of all joint distributions \u03b3(x, y) with marginals \u00b5(x) and \u03bd(y), and\nc(x, y) is the cost function (e.g., Euclidean or cosine distance). Intuitively, EMD is the minimum cost\nthat \u03b3 has to transport from \u00b5 to \u03bd.\n\n3 Feature Mover GAN\n\nWe propose a new GAN framework for discrete text data, called feature mover GAN (FM-GAN).\nThe idea of optimal transport (OT) is integrated into adversarial distribution matching. Explicitly,\nthe original critic function in GANs is replaced by the Earth-Mover\u2019s Distance (EMD) between\nthe sentence features of real and synthetic data. In addition, to handle the intractable issue when\ncomputing (2) [3, 49], we de\ufb01ne the Feature-Mover\u2019s Distance (FMD), a variant of EMD that can\nbe solved tractably using the Inexact Proximal point method for OT (IPOT) algorithm [59]. In the\nfollowing sections, we discuss the main objective of our model, the detailed training process for text\ngeneration, as well as extensions. Illustration of the framework is shown in Figure 1.\n\n3\n\n IPOTmax!min\u2713DFMDargmaxsoftargmax{zj}G(zj;\u2713) Wi W0j F(Wi;!) F(W0i;!) {f0j}{fi}fif0jCCijT{xi}\f3.1 Feature-mover\u2019s distance\n\nIn practice, it is not tractable to calculate the minimization over \u03b3 in (2) [3, 19, 50]. In this section,\nwe propose the Feature-Mover\u2019s Distance (FMD) which can be solved tractably. Consider two sets\nj=1 \u2208 Rd\u00d7n drawn from two\nof sentence feature vectors F = {f i}m\ndifferent sentence feature distributions Pf and Pf(cid:48); m and n are the total number of d-dimensional\nsentence features in F and F(cid:48), respectively. Let T \u2208 Rm\u00d7n be a transport matrix in which Tij \u2265 0\nde\ufb01nes how much of feature vector f i would be transformed to f(cid:48)j. The FMD between two sets of\nsentence features is then de\ufb01ned as:\n\ni=1 \u2208 Rd\u00d7m and F(cid:48) = {f(cid:48)j}n\n\nj=1 Tij = 1\n\nT\u22650 (cid:104)T, C(cid:105) ,\n\nTij \u00b7 c(f i, f(cid:48)j) = min\n\nDFMD(Pf , Pf(cid:48)) = min\nT\u22650\nm and(cid:80)m\n\nm(cid:88)i=1\nn(cid:88)j=1\nwhere(cid:80)n\nn are the constraints, and (cid:104)\u00b7,\u00b7(cid:105) represents the Frobenius dot-\nproduct. In this work, the transport cost is de\ufb01ned as the cosine distance: c(f i, f(cid:48)j) = 1\u2212\n,\nj(cid:107)2\nand C is the cost matrix such that Cij = c(f i, f(cid:48)j). Note that during training, we set m = n as the\nmini-batch size.\nWe propose to use the Inexact Proximal point method for Optimal Transport (IPOT) algorithm to\ncompute the optimal transport matrix T\u2217, which provides a solution to the original optimal transport\nproblem (3) [59]. Speci\ufb01cally, IPOT iteratively solves the following optimization problem:\n\nf(cid:62)\ni f(cid:48)\n(cid:107)f i(cid:107)2(cid:107)f(cid:48)\n\ni=1 Tij = 1\n\n(3)\n\nj\n\nT(t+1) = argminT\u2208\u03a0(f ,f(cid:48))(cid:104)T, C(cid:105) + \u03b2Dh(T, T(t)) ,\n\n(4)\n\nij \u2212(cid:80)i,j Tij +(cid:80)i,j T(t)\n\nT(t)\n\nwhere Dh(T, T(t)) =(cid:80)i,j Tij log Tij\nwrt the entropy functional h(T) =(cid:80)i,j Tij log Tij.\n\nHere the Bregman divergence Dh serves as a prox-\nimity metric and \u03b2 is the proximity penalty. This\nproblem can be solved ef\ufb01ciently by Sinkhorn-style\nproximal point iterations [13, 59], as detailed in Al-\ngorithm 1.\nNotably, unlike the Sinkhorn algorithm [19], we do\nnot need to back-propagate the gradient through the\nproximal point iterations, which is justi\ufb01ed by the\nEnvelope Theorem [1] (see the Supplementary Ma-\nterial (SM)). This accelerates the learning process\nsigni\ufb01cantly and improves training stability [59].\n\nij denotes the Bregman divergence\n\nj), Aij = e\n\nj}n\nj=1, \u03b2\n\nn 1n, T(1) = 11(cid:62)\n\nAlgorithm 1 IPOT algorithm [59]\ni=1,{f(cid:48)\n1: Input: batch size n, {f i}n\n2: \u03c3 = 1\n3: Cij = c(f i, f(cid:48)\n4: for t = 1, 2, 3 . . . do\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nend for\nT(t+1) = diag(\u03b4)Qdiag(\u03c3)\n\nQ = A (cid:12) T(t) // (cid:12) is Hadamard product\nfor k = 1, 2, 3, . . . K do\nnQ\u03c3 , \u03c3 = 1\nnQ(cid:62)\u03b4\n\n\u03b4 = 1\n\n\u2212 Cij\n\n\u03b2\n\nAlgorithm 2 Adversarial text generation via FMD.\n1: Input: batch size n, dataset X, learning rate \u03b7, maximum number of iterations N.\n2: for itr = 1, . . . N do\n3:\n4:\n5:\n6:\n\n1 \u223c X and {zi}n\nSample a mini-batch of {xi}n\nExtract sentence features F = {F (Wexi; \u03c6)}n\nUpdate the feature extractor F (\u00b7; \u03c6) by maximizing:\n\n1 \u223c N (0, I);\n1 ;\n1 and F(cid:48) = {F (G(zi; \u03b8); \u03c6)}n\n\nfor j = 1, . . . , J do\n\nLFM-GAN({xi}n\n\n1 ,{zi}n\n\n1 ; \u03c6) = DFMD(F, F(cid:48); \u03c6)\n\n7:\n8:\n9:\n\nend for\nRepeat Step 4 and 5;\nUpdate the generator G(\u00b7; \u03b8) by minimizing:\n1 ,{zi}n\n\nLFM-GAN({xi}n\n\n10: end for\n\n1 ; \u03b8) = DFMD(F, F(cid:48); \u03b8)\n\n4\n\n\f3.2 Adversarial distribution matching with FMD\n\nmin\n\nG\n\nF\n\nF LFM-GAN = min\nmax\n\nG\n\nmax\n\nTo integrate FMD into adversarial distribution matching, we propose to solve the following mini-max\ngame:\n\nEx\u223cpx,z\u223cpz [DFMD(F (Wex), F (G(z)))] ,\n\n(5)\nwhere F (\u00b7) is the sentence feature extractor, and G(\u00b7) is the sentence generator. We call this feature\nmover GAN (FM-GAN). The detailed training procedure is provided in Algorithm 2.\nSentence generator\nThe Long Short-Term Memory (LSTM) recurrent neural network [25] is\nused as our sentence generator G(\u00b7) parameterized by \u03b8. Let We \u2208 Rk\u00d7v be our learned word\nembedding matrix, where v is the vocabulary size, with each word in sentence x embedded into wt,\na k-dimensional word vector. All words in the synthetic sentence are generated sequentially, i.e.,\n\nwt = We argmax(at) , where at = Vht \u2208 Rv ,\n\n(6)\nwhere ht is the hidden unit updated recursively through the LSTM cell: ht = LSTM(wt\u22121, ht\u22121, z),\nV is a decoding matrix, softmax(Vht) de\ufb01nes the distribution over the vocabulary. Note that, distinct\nfrom a traditional sentence generator, here, the argmax operation is used, rather than sampling from a\nmultinomial distribution, as in the standard LSTM. Therefore, all randomness during the generation\nis clamped into the noise vector z.\nThe generator G cannot be trained, due to the non-differentiable function argmax. Instead, an\nsoft-argmax operator [61] is used as a continuous approximation:\n\n\u02dcat = Vht/\u03c4 \u2208 Rv ,\n\n\u02dcwt = We softmax(\u02dcat) , where\n\n(7)\nwhere \u03c4 is the temperature parameter. Note when \u03c4 \u2192 0, this approximates (6). We denote\nG(z) = ( \u02dcw1, . . . , \u02dcwL) \u2208 Rk\u00d7L as the approximated embedding matrix for the synthetic sentence.\nFeature extractor We use the convolutional neural network proposed in [11, 29] as our sentence\nfeature extractor F (\u00b7) parameterized by \u03c6, which contains a convolution layer and a max-pooling\nlayer. Assuming a sentence of length L, the sentence is represented as a matrix W \u2208 Rk\u00d7L,\nwhere k is the word-embedding dimension, and L is the maximum sentence length. A convolution\n\ufb01lter Wconv \u2208 Rk\u00d7l is applied to a window of l words to produce new features. After applying\nthe nonlinear activation function, we then use the max-over-time pooling operation [11] to the\nfeature maps and extract the maximum values. While the convolution operator can extract features\nindependent of their positions in the sentence, the max-pooling operator tries to capture the most\nsalient features.\nThe above procedure describes how to extract features using one \ufb01lter. Our model uses multiple\n\ufb01lters with different window sizes, where each \ufb01lter is considered as a linguistic feature detector.\nAssume d1 different window sizes, and for each window size we have d2 \ufb01lters; then a sentence\nfeature vector can be represent as f = F (W) \u2208 Rd, where d = d1 \u00d7 d2.\n3.3 Extensions to conditional text generation tasks\n\nStyle transfer\nOur FM-GAN model can be readily generalized to conditional generation tasks,\nsuch as text style transfer [26, 35, 44, 54]. The style transfer task is essentially learning the con-\nditional distribution p(x2|x1; c1, c2) and p(x1|x2; c1, c2), where c1 and c2 represent the labels for\ndifferent styles, with x1 and x2 sentences in different styles. Assuming x1 and x2 are conditionally\nindependent given the latent code z, we have:\n\np(x1|x2; c1, c2) =(cid:90)z\n\np(x1|z, c1) \u00b7 p(z|x2, c2)dz = Ez\u223cp(z|x2,c2)[p(x1|z, c1)] .\n\n(8)\n\nEquation (8) suggests an autoencoder can be applied for this task. From this perspective, we can\napply our optimal transport method in the cross-aligned autoencoder [54], by replacing the standard\nGAN loss with our FMD critic. We follow the same idea as [54] to build the style transfer framework.\nE : X \u00d7 C \u2192 Z is our encoder that infers the content z from given style c and sentence x;\nG : Z \u00d7 C \u2192 X is our decoder that generates synthetic sentence \u02c6x, given content z and style c. We\nadd the following reconstruction loss for the autoencoder:\n(9)\n\nLrec = Ex1\u223cpx1 [\u2212 log pG(x1|c1, E(x1, c1))] + Ex2\u223cpx2 [\u2212 log pG(x2|c2, E(x2, c2))] ,\n\n5\n\n\fwhere px1 and px2 are the empirical data distribution for each style. We also need to implement\nadversarial training on the generator G with discrete data. First, we use the soft-argmax approximation\ndiscussed in Section 3.2; second, we also use Professor-Forcing [32] algorithm to match the sequence\nof LSTM hidden states. That is, the discriminator is designed to discriminate \u02c6x2 = G(E(x1, c1), c2)\nwith real sentence x2. Unlike [54] which uses two discriminators, our model only needs to apply the\nFMD critic twice to match the distributions for two different styles:\n\nLadv = Ex1\u223cpx1 ,x2\u223cpx2 [DFMD(F (G(E(x1, c1), c2)), F (Wex2))\n\n(10)\n\n+ DFMD(F (G(E(x2, c2), c1)), F (Wex1))] ,\n\nwhere We is the learned word embedding matrix. The \ufb01nal objective function for this task is:\nminG,E maxF Lrec + \u03bb \u00b7 Ladv, where \u03bb is a hyperparameter that balances these two terms.\nUnsupervised decipher\nOur model can also be used to tackle the task of unsupervised cipher\ncracking by using the framework of CycleGAN [62]. In this task, we have two different corpora,\ni.e., X1 denotes the original sentences, and X2 denotes the encrypted corpus using some cipher\ncode, which is unknown to our model. Our goal is to design two generators that can map one\ncorpus to the other, i.e., G1 : X1 \u2192 X2, G2 : X2 \u2192 X1. Unlike the style-transfer task, we de\ufb01ne\nF1 and F2 as two sentence feature extractors for the different corpora. Here we denote px1 to be\nthe empirical distribution of the original corpus, and px2 to be the distribution of the encrypted\ncorpus. Following [20], we design two losses: the cycle-consistency loss (reconstruction loss) and\nthe adversarial feature matching loss. The cycle-consistency loss is de\ufb01ned on the feature space as:\nLcyc = Ex1\u223cpx1 [(cid:107)F1(G2(G1(x1)))\u2212F1(Wex1)(cid:107)1]+Ex2\u223cpx2 [(cid:107)F2(G1(G2(x2)))\u2212F2(Wex2)(cid:107)1] ,\n(11)\nwhere (cid:107) \u00b7 (cid:107)1 denotes the (cid:96)1-norm, and We is the word embedding matrix. The adversarial loss aims\nto help match the generated samples with the target:\nLadv = Ex1\u223cpx1 ,x2\u223cpx2 [DFMD(F1(G2(x2)), F1(Wex1)) + DFMD(F2(G1(x1)), F2(Wex2))] .\n(12)\nThe \ufb01nal objective function for the decipher task is: minG1,G2 maxF1,F2 Lcyc + \u03bb \u00b7 Ladv, where \u03bb is\na hyperparameter that balances the two terms.\n\n4 Related work\n\nGAN for text generation\nSeqGAN [60], MaliGAN [8], RankGAN [37], and MaskGAN [15]\nuse reinforcement learning (RL) algorithms for text generation. The idea behind all these works are\nsimilar: they use the REINFORCE algorithm to get an unbiased gradient estimator for the generator,\nand apply the roll-out policy to obtain the reward from the discriminator. LeakGAN [24] adopts a\nhierarchical RL framework to improve text generation. However, it is slow to train due to its complex\ndesign. For GANs in the RL-free category, GSGAN [31] and TextGAN [61] use the Gumbel-softmax\nand soft-argmax trick, respectively, to deal with discrete data. While the latter uses MMD to match\nthe features of real and synthetic sentences, both models still keep the original GAN loss function,\nwhich may result in the gradient-vanishing issue of the discriminator.\nGAN with OT Wasserstein GAN (WGAN) [3, 23] applies the EMD by imposing the 1-Lipschitz\nconstraint on the discriminator, which alleviates the gradient-vanishing issue when dealing with\ncontinuous data (i.e., images). However, for discrete data (i.e., text), the gradient still vanishes after a\nfew iterations, even when weight-clipping or the gradient-penalty is applied on the discriminator [20].\nInstead, the Sinkhorn divergence generative model (Sinkhorn-GM) [19] and Optimal transport GAN\n(OT-GAN) [50] optimize the Sinkhorn divergence [13], de\ufb01ned as an entropy regularized EMD (2):\n\nW\u0001(f , f(cid:48)) = minT\u2208\u03a0(f ,f(cid:48))(cid:104)T, C(cid:105) + \u0001 \u00b7 h(T), where h(T) =(cid:80)i,j Tij log Tij is the entropy term,\n\nand \u0001 is the hyperparameter. While the Sinkhorn algorithm [13] is proposed to solve this entropy\nregularized EMD, the solution is sensitive to the value of the hyperparameter \u0001, leading to a trade-off\nbetween computational ef\ufb01ciency and training stability. Distinct from that, our method uses IPOT to\ntackle the original problem of OT. In practice, IPOT is more ef\ufb01cient than the Sinkhorn algorithm,\nand the hyperparameter \u03b2 in (4) only affects the convergence rate [59].\n\n5 Experiment\n\nWe apply the proposed model to three application scenarios: generic (unconditional) sentence\ngeneration, conditional sentence generation (with pre-speci\ufb01ed sentiment), and unsupervised decipher.\n\n6\n\n\fDataset\n\nCUB captions\n\nMS COCO captions\n\nEMNLP2017 WMT News\n\nTrain\n100,000\n120,000\n278,686\n\nTest\n10,000\n10,000\n10,000\n\nVocabulary\n\naverage length\n\n4,391\n27,842\n5,728\n\n15\n11\n28\n\nTable 1: Summary statistics for the datasets used in the generic text generation experiments.\n\nFigure 2: Test-BLEU score (higher value implies better quality) vs self-BLEU score (lower value implies better\ndiversity). Upper panel is BLEU-3 and lower panel is BLEU-4.\n\nFor the generic sentence generation task, we experiment with three standard benchmarks: CUB\ncaptions [57], MS COCO captions [38], and EMNLP2017 WMT News [24].\nSince the sentences in the CUB dataset are typically short and have similar structure, it is employed\nas our toy evaluation. For the second dataset, we sample 130, 000 sentences from the original MS\nCOCO captions. Note that we do not remove any low-frequency words for the \ufb01rst two datasets, in\norder to evaluate the models in the case with a relatively large vocabulary size. The third dataset\nis a large long-text collection from EMNLP2017 WMT News Dataset. To facilitate comparison\nwith baseline methods, we follow the same data preprocessing procedures as in [24]. The summary\nstatistics of all the datasets are presented in Table 1.\nFor conditional text generation, we consider the task of transferring an original sentence to the\nopposite sentiment, in the case where parallel (paired) data are not available. We use the same\ndata as introduced in [54]. For the unsupervised decipher task, we follow the experimental setup in\nCipherGAN [20] and evaluate the model improvement after replacing the critic with the proposed\nFMD objective.\nWe employ test-BLEU score [60], self-BLEU score [63], and human evaluation as the evaluation\nmetrics for the generic sentence generation task. To ensure fair comparison, we perform extensive\ncomparisons with several strong baseline models using the benchmark tool in Texygen [63]. For\nthe non-parallel text style transfer experiment, following [26, 54], we use a pretrained classi\ufb01er to\ncalculate the sentiment accuracy of transferred sentences. We also leverage human evaluation to\nfurther measure the quality of the transferring results. For the deciphering experiment, we adopt the\naverage proportion of correctly mapped words as accuracy as proposed in [20]. Our code will be\nreleased to encourage future research.\n\n5.1 Generic text generation\n\nIn general, when evaluating the performance of different models, we desire high test-BLEU score\n(good quality) and low self-BLEU score (high diversity). Both scores should be considered: (i) a high\ntest-BLEU score together with a high self-BLEU score means that the model might generate good\nsentences while suffering from mode collapse (i.e., low diversity); (ii) if a model generates sentences\nrandomly, the diversity of generated sentence could be high but the test-BLEU score would be low.\nFigure 2 is used to compare the performance of every model. For each subplot, the x-axis represents\ntest-BLEU, and the y-axis represents self-BLEU (here we only show BLEU-3 and BLEU-4 \ufb01gures;\nmore quantitative results can be found in the SM). For the CUB and MS COCO datasets, our model\nachieves both high test-BLEU and low self-BLEU, providing realistic sentences with high diversity.\nFor the EMNLP WMT dataset, the synthetic sentences from SeqGAN, RankGAN, GSGAN and\n\n7\n\nText GANGS GANLeak GANRank GANSeq GANMLERL-FreeRL-BasedCUBMS COCO EMNLP WMTFM GAN\fTextGAN is less coherent and realistic (examples can be found in the SM) due to the long-text nature\nof the dataset. In comparison, our model is still capable of providing realistic results.\nTo further evaluate the generation\nquality based on the EMNLP WMT\ndataset, we conduct a human Turing\ntest on Amazon Mechanical Turk; 10\njudges are asked to rate over 100 ran-\ndomly sampled sentences from each\nmodel with a scale from 0 to 5. The means and standard deviations of the rating score are calculated\nand provided in Table 2. We also provide some examples of the generated sentences from LeakGAN\nand our model in Table 3. More generated sentences are provided in the SM.\n\nTable 2: Human evaluation results on EMNLP WMT.\n\nLeakGAN\n3.41\u00b10.82\nreal sentences\n4.21\u00b10.77\n\nSeqGAN\n2.55 \u00b1 0.83\nTextGAN\n3.03\u00b10.92\n\nRankGAN\n2.86 \u00b1 0.95\nOur model\n3.72\u00b10.80\n\n2.54 \u00b1 0.79\nGSGAN\n2.52\u00b10.78\n\nHuman score\n\nHuman score\n\nMethod\n\nMLE\n\nMethod\n\nLeakGAN:\n\nOurs:\n\n(1) \" people , if aleppo recognised switzerland stability , \" mr . trump has said that \" \" it has been\n\ufb01lled before the courts .\n(2) the russian military , meanwhile previously infected orders , but it has already been done\non the lead of the attack .\n(1) this is why we will see the next few years , we \u2019 re looking forward to the top of the world ,\nwhich is how we \u2019 re in the future .\n(2) If you \u2019 re talking about the information about the public , which is not available , they have\nto see a new study .\nTable 3: Examples of generated sentences from LeakGAN and our model.\n\n5.2 Non-parallel text style transfer\n\nTable 4 presents the sentiment transfer results on the Yelp review dataset, which is evaluated with\nthe accuracy of transferred sentences, determined by a pretrained CNN classi\ufb01er [29]. Note that\nwith the same experimental setup as in [54], our model achieves signi\ufb01cantly higher transferring\naccuracy compared with the cross-aligned autoencoder (CAE) model [54]. Moreover, our model even\noutperforms the controllable text generation method [26] and BST [44], where a sentiment classi\ufb01er\nis directly pre-trained to guide the sentence generation process (on the contrary, our model is trained\nin an end-to-end manner and requires no pre-training steps), and thus should potentially have a better\ncontrol over the style (i.e., sentiment) of generated sentences [54]. The superior performance of the\nproposed method highlights the ability of FMD to mitigate the vanishing-gradient issue caused by the\ndiscrete nature of text samples, and give rises to better matching between the distributions of reviews\nbelonging to two different sentiments.\nHuman evaluations are con-\nducted to assess the quality\nof the transferred sentences.\nIn this regard, we randomly\nsample 100 sentences from\nthe test set, and 5 volunteers\nrate the outputs of different\nmodels in terms of their \ufb02u-\nency, sentiment, and con-\ntent preservation in a double blind fashion. The rating score is from 0 to 5. Detailed results\nare shown in Table 4. We also provide sentiment transfer examples in Table 5. More examples are\nprovided in the SM.\n\nTable 4: Sentiment transfer accuracy and human evaluation results on Yelp.\n\nMethod\nAccuracy(%)\nSentiment\nContent\nFluency\n\nControllable [26] CAE [54]\n\nBST [44] Our model\n\n89.8\n4.1\n4.5\n4.4\n\n84.5\n3.6\n4.6\n4.2\n\n80.6\n3.2\n4.1\n3.7\n\n87.2\n\n-\n-\n-\n\nOriginal:\nControllable :\nCAE:\nOurs:\nOriginal:\nControllable:\nCAE:\nOurs:\n\none of the best gourmet store shopping experiences i have ever had .\none of the best gourmet store shopping experiences i have ever had .\none of the worst staff i would ever ever ever had ever had .\none of the worst indian shopping store experiences i have ever had .\nstaff behind the deli counter were super nice and ef\ufb01cient !\nstaff behind the deli counter were super rude and ef\ufb01cient !\nthe staff were the front desk and were extremely rude airport !\nstaff behind the deli counter were super nice and inef\ufb01cient !\n\nTable 5: Sentiment transfer examples.\n\n8\n\n\f5.3 Unsupervised decipher\n\nCipherGAN [20] uses GANs to tackle the task of unsupervised cipher cracking, utilizing the frame-\nwork of CycleGAN [62] and adopting techniques such as Gumbel-softmax [31] that deal with discrete\ndata. The implication of unsupervised deciphering could be understood as unsupervised machine\ntranslation, in which one language might be treated as an enciphering of the other. In this experiment,\nwe adapt the idea of feature mover\u2019s distance to the original framework of CipherGAN and test this\nmodi\ufb01ed model on the Brown English text dataset [16].\nThe Brown English-language corpus [30] has a vocabulary size of over one million. In this experiment,\nonly the top 200 most frequent words are considered while the others are replaced by an \u201cunknown\u201d\ntoken. We denote this modi\ufb01ed word-level dataset as Brown-W200. We use Vigen\u00e8re [7] to encipher\nthe original plain text. This dataset can be downloaded from this repository1.\nFor fair comparison, all the model architectures and parameters are kept the same as CipherGAN\nwhile the critic for the discriminator is replaced by the FMD objective as shown in (3). Table 6 shows\nthe quantitative results in terms of average proportion of words mapped in a given sequence (i.e.,\ndeciphering accuracy). The baseline frequency analysis model only operates when the cipher key is\nknown. Our model achieves higher accuracy compared to the original CipherGAN. Note that some\nother experimental setups from [20] are not evaluated, due to the extremely high accuracy (above\n99%); the amount of improvement would not be apparent.\n\nMethod\n\nFreq. Analysis (with keys) CipherGAN [20] Our model\n\nAccuracy(%)\n\n< 0.1 (44.3)\n\n75.7\n\n77.2\n\nTable 6: Decipher results on Brown-W200.\n\n6 Conclusion\n\nWe introduce a novel approach for text generation using feature-mover\u2019s distance (FMD), called\nfeature mover GAN (FM-GAN). By applying our model to several tasks, we demonstrate that it\ndelivers good performance compared to existing text generation approaches. For future work, FM-\nGAN has the potential to be applied on other tasks such as image captioning [56], joint distribution\nmatching [9, 17, 34, 45, 46, 55], unsupervised sequence classi\ufb01cation [39], and unsupervised machine\ntranslation [4, 12, 33].\n\nAcknowledgments\n\nThis research was supported in part by DARPA, DOE, NIH, ONR and NSF.\n\nReferences\n[1] S. Afriat. Theory of maxima and the method of lagrange. SIAM Journal on Applied Mathematics,\n\n1971.\n\n[2] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial\n\nnetworks. In ICLR, 2017.\n\n[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML,\n\n2017.\n\n[4] M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. In\n\nICLR, 2018.\n\n[5] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. In ICLR, 2015.\n\n[6] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction\n\nwith recurrent neural networks. In NIPS, 2015.\n\n1https://github.com/for-ai/CipherGAN\n\n9\n\n\f[7] A. A. Bruen and M. A. Forcinito. Cryptography, information theory, and error-correction: a\n\nhandbook for the 21st century, volume 68. John Wiley & Sons, 2011.\n\n[8] T. Che, Y. Li, R. Zhang, R. D. Hjelm, W. Li, Y. Song, and Y. Bengio. Maximum-likelihood\n\naugmented discrete generative adversarial networks. In arXiv:1702.07983, 2017.\n\n[9] L. Chen, S. Dai, Y. Pu, E. Zhou, C. Li, Q. Su, C. Chen, and L. Carin. Symmetric variational\n\nautoencoder and connections to adversarial learning. In AISTATS, 2018.\n\n[10] K. Cho, B. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.\nLearning phrase representations using rnn encoder-decoder for statistical machine translation.\nIn EMNLP, 2014.\n\n[11] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language\n\nprocessing (almost) from scratch. JMLR, 2011.\n\n[12] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. J\u00e9gou. Word translation without\n\nparallel data. arXiv preprint arXiv:1710.04087, 2017.\n\n[13] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.\n\n[14] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Doll\u00e1r, and J. Gao. From captions to\n\nvisual concepts and back. In CVPR, 2015.\n\n[15] W. Fedus, I. Goodfellow, and A. M. Dai. MaskGAN: Better text generation via \ufb01lling in the _.\n\nICLR, 2018.\n\n[16] W. N. Francis. Brown corpus manual. http://icame. uib. no/brown/bcm. html, 1979.\n\n[17] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative\n\nadversarial networks. In NIPS, 2017.\n\n[18] Z. Gan, Y. Pu, R. Henao, C. Li, X. He, and L. Carin. Learning generic sentence representations\n\nusing convolutional neural networks. In EMNLP, 2017.\n\n[19] A. Genevay, G. Peyr\u00e9, and M. Cuturi. Learning generative models with sinkhorn divergences.\n\nIn AISTATS, 2018.\n\n[20] A. N. Gomez, S. Huang, I. Zhang, B. M. Li, M. Osama, and L. Kaiser. Unsupervised cipher\n\ncracking using discrete gans. In arXiv:1801.04883, 2018.\n\n[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[22] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample\n\ntest. JMLR, 2012.\n\n[23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\n\nWasserstein GANs. In NIPS, 2017.\n\n[24] J. Guo, S. Lu, H. Cai, W. Zhang, Y. Yu, and J. Wang. Long text generation via adversarial\n\ntraining with leaked information. In AAAI, 2018.\n\n[25] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.\n\n[26] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of\n\ntext. In ICML, 2017.\n\n[27] F. Husz\u00e1r. How (not) to train your generative model: Scheduled sampling, likelihood, adversary?\n\nIn arXiv:1511.05101, 2015.\n\n[28] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with Gumbel-softmax. In ICLR,\n\n2017.\n\n[29] Y. Kim. Convolutional neural networks for sentence classi\ufb01cation. In EMNLP, 2014.\n\n10\n\n\f[30] H. Kucera and W. Francis. A standard corpus of present-day edited american english, for use\n\nwith digital computers (revised and ampli\ufb01ed from 1967 version), 1979.\n\n[31] M. J. Kusner and J. M. Hern\u00e1ndez-Lobato. GANS for sequences of discrete elements with the\n\nGumbel-softmax distribution. In arXiv:1611.04051, 2016.\n\n[32] A. Lamb, V. Dumoulin, and A. Courville. Discriminative regularization for generative models.\n\nIn arXiv:1602.03220, 2016.\n\n[33] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato. Phrase-based & neural unsuper-\n\nvised machine translation. arXiv preprint arXiv:1804.07755, 2018.\n\n[34] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. Alice: Towards understanding\n\nadversarial learning for joint distribution matching. In NIPS, 2017.\n\n[35] J. Li, R. Jia, H. He, and P. Liang. Delete, retrieve, generate: A simple approach to sentiment\n\nand style transfer. In NAACL, 2018.\n\n[36] J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue\n\ngeneration. In EMNLP, 2017.\n\n[37] K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun. Adversarial ranking for language generation. In\n\nNIPS, 2017.\n\n[38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick.\n\nMicrosoft COCO: Common objects in context. In ECCV, 2014.\n\n[39] Y. Liu, J. Chen, and L. Deng. An unsupervised learning method exploiting sequential output\n\nstatistics. In arXiv:1702.07817, 2017.\n\n[40] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. In ICLR, 2017.\n\n[41] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. \u02c7Cernock\u00fd, and S. Khudanpur. Recurrent neural network\n\nbased language model. In ISCA, 2010.\n\n[42] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using\n\nvariational divergence minimization. In NIPS, 2016.\n\n[43] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of\n\nmachine translation. In ACL, 2002.\n\n[44] S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black. Style transfer through\n\nback-translation. In ACL, 2018.\n\n[45] Y. Pu, S. Dai, Z. Gan, W. Wang, G. Wang, Y. Zhang, R. Henao, and L. Carin. Jointgan:\n\nMulti-domain joint distribution learning with generative adversarial nets. In ICML, 2018.\n\n[46] Y. Pu, W. Wang, R. Henao, L. Chen, Z. Gan, C. Li, and L. Carin. Adversarial symmetric\n\nvariational autoencoder. In NIPS, 2017.\n\n[47] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural\n\nnetworks. In ICLR, 2016.\n\n[48] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image\n\ndatabases. In ICCV, 1998.\n\n[49] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\ntechniques for training GANs. In NIPS, 2016.\n\n[50] T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving GANs using optimal transport.\n\nIn ICLR, 2018.\n\n[51] D. Shen, G. Wang, W. Wang, M. R. Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin.\nBaseline needs more love: On simple word-embedding-based models and associated pooling\nmechanisms. In ACL, 2018.\n\n11\n\n\f[52] D. Shen, Y. Zhang, R. Henao, Q. Su, and L. Carin. Deconvolutional latent-variable model for\n\ntext sequence matching. In AAAI, 2018.\n\n[53] S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for neural\n\nmachine translation. In ACL, 2015.\n\n[54] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola. Style transfer from non-parallel text by cross-\n\nalignment. In NIPS, 2017.\n\n[55] C. Tao, L. Chen, R. Henao, J. Feng, and L. Carin. Chi-square generative adversarial network.\n\nIn ICML, 2018.\n\n[56] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption\n\ngenerator. In CVPR, 2015.\n\n[57] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD\n\nBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[58] S. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search optimization. In\n\nEMNLP, 2016.\n\n[59] Y. Xie, X. Wang, R. Wang, and H. Zha. A fast proximal point method for Wasserstein distance.\n\nIn arXiv:1802.04307, 2018.\n\n[60] L. Yu, W. Zhang, J. Wang, and Y. Yu. SeqGAN: Sequence generative adversarial nets with\n\npolicy gradient. In AAAI, 2017.\n\n[61] Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, and L. Carin. Adversarial feature\n\nmatching for text generation. In ICML, 2017.\n\n[62] J. Zhu, T. Park, P. Isola, and A. Efros. Unpaired image-to-image translation using cycle-\n\nconsistent adversarial networks. In ICCV, 2017.\n\n[63] Y. Zhu, S. Lu, L. Zheng, J. Guo, W. Zhang, J. Wang, and Y. Yu. Texygen: A benchmarking\n\nplatform for text generation models. In SIGIR, 2018.\n\n12\n\n\f", "award": [], "sourceid": 2273, "authors": [{"given_name": "Liqun", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Shuyang", "family_name": "Dai", "institution": "Duke University"}, {"given_name": "Chenyang", "family_name": "Tao", "institution": "Duke University"}, {"given_name": "Haichao", "family_name": "Zhang", "institution": "Baidu Research"}, {"given_name": "Zhe", "family_name": "Gan", "institution": "Microsoft"}, {"given_name": "Dinghan", "family_name": "Shen", "institution": "Duke University"}, {"given_name": "Yizhe", "family_name": "Zhang", "institution": "Microsoft Research"}, {"given_name": "Guoyin", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Ruiyi", "family_name": "Zhang", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}