{"title": "Dual Learning for Machine Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 820, "page_last": 828, "abstract": "While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual-learning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the language-model likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation \\emph{dual-NMT}. Experiments show that dual-NMT works very well on English$\\leftrightarrow$French translation; especially, by learning from monolingual data (with 10\\% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task.", "full_text": "Dual Learning for Machine Translation\n\nDi He1,\u2217, Yingce Xia2,\u2217, Tao Qin3, Liwei Wang1, Nenghai Yu2, Tie-Yan Liu3, Wei-Ying Ma3\n\n1Key Laboratory of Machine Perception (MOE), School of EECS, Peking University\n\n2University of Science and Technology of China\n\n3Microsoft Research\n\n1{dih,wanglw}@cis.pku.edu.cn; 2xiayingc@mail.ustc.edu.cn; 2ynh@ustc.edu.cn\n\n3{taoqin,tie-yan.liu,wyma}@microsoft.com\n\nAbstract\n\nWhile neural machine translation (NMT) is making good progress in the past\ntwo years, tens of millions of bilingual sentence pairs are needed for its training.\nHowever, human labeling is very costly. To tackle this training data bottleneck, we\ndevelop a dual-learning mechanism, which can enable an NMT system to automat-\nically learn from unlabeled data through a dual-learning game. This mechanism is\ninspired by the following observation: any machine translation task has a dual task,\ne.g., English-to-French translation (primal) versus French-to-English translation\n(dual); the primal and dual tasks can form a closed loop, and generate informative\nfeedback signals to train the translation models, even if without the involvement of\na human labeler. In the dual-learning mechanism, we use one agent to represent the\nmodel for the primal task and the other agent to represent the model for the dual\ntask, then ask them to teach each other through a reinforcement learning process.\nBased on the feedback signals generated during this process (e.g., the language-\nmodel likelihood of the output of a model, and the reconstruction error of the\noriginal sentence after the primal and dual translations), we can iteratively update\nthe two models until convergence (e.g., using the policy gradient methods). We call\nthe corresponding approach to neural machine translation dual-NMT. Experiments\nshow that dual-NMT works very well on English\u2194French translation; especially,\nby learning from monolingual data (with 10% bilingual data for warm start), it\nachieves a comparable accuracy to NMT trained from the full bilingual data for the\nFrench-to-English translation task.\n\n1\n\nIntroduction\n\nState-of-the-art machine translation (MT) systems, including both the phrase-based statistical transla-\ntion approaches [6, 3, 12] and the recently emerged neural networks based translation approaches\n[1, 5], heavily rely on aligned parallel training corpora. However, such parallel data are costly to\ncollect in practice and thus are usually limited in scale, which may constrain the related research and\napplications.\nGiven that there exist almost unlimited monolingual data in the Web, it is very natural to leverage\nthem to boost the performance of MT systems. Actually different methods have been proposed for this\npurpose, which can be roughly classi\ufb01ed into two categories. In the \ufb01rst category [2, 4], monolingual\ncorpora in the target language are used to train a language model, which is then integrated with the\nMT models trained from parallel bilingual corpora to improve the translation quality. In the second\ncategory [14, 11], pseudo bilingual sentence pairs are generated from monolingual data by using the\n\u2217The \ufb01rst two authors contributed equally to this work. This work was conducted when the second author\n\nwas visiting Microsoft Research Asia.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftranslation model trained from aligned parallel corpora, and then these pseudo bilingual sentence\npairs are used to enlarge the training data for subsequent learning. While the above methods could\nimprove the MT performance to some extent, they still suffer from certain limitations. The methods\nin the \ufb01rst category only use the monolingual data to train language models, but do not fundamentally\naddress the shortage of parallel training data. Although the methods in the second category can\nenlarge the parallel training data, there is no guarantee/control on the quality of the pseudo bilingual\nsentence pairs.\nIn this paper, we propose a dual-learning mechanism that can leverage monolingual data (in both\nthe source and target languages) in a more effective way. By using our proposed mechanism, these\nmonolingual data can play a similar role to the parallel bilingual data, and signi\ufb01cantly reduce the\nrequirement on parallel bilingual data during the training process. Speci\ufb01cally, the dual-learning\nmechanism for MT can be described as the following two-agent communication game.\n\n1. The \ufb01rst agent, who only understands language A, sends a message in language A to the\nsecond agent through a noisy channel, which converts the message from language A to\nlanguage B using a translation model.\n\n2. The second agent, who only understands language B, receives the translated message in\nlanguage B. She checks the message and noti\ufb01es the \ufb01rst agent whether it is a natural\nsentence in language B (note that the second agent may not be able to verify the correctness\nof the translation since the original message is invisible to her). Then she sends the received\nmessage back to the \ufb01rst agent through another noisy channel, which converts the received\nmessage from language B back to language A using another translation model.\n\n3. After receiving the message from the second agent, the \ufb01rst agent checks it and noti\ufb01es\nthe second agent whether the message she receives is consistent with her original message.\nThrough the feedback, both agents will know whether the two communication channels (and\nthus the two translation models) perform well and can improve them accordingly.\n\n4. The game can also be started from the second agent with an original message in language B,\nand then the two agents will go through a symmetric process and improve the two channels\n(translation models) according to the feedback.\n\nIt is easy to see from the above descriptions, although the two agents may not have aligned bilingual\ncorpora, they can still get feedback about the quality of the two translation models and collectively\nimprove the models based on the feedback. This game can be played for an arbitrary number of\nrounds, and the two translation models will get improved through this reinforcement procedure (e.g.,\nby means of the policy gradient methods). In this way, we develop a general learning framework for\ntraining machine translation models through a dual-learning game.\nThe dual learning mechanism has several distinguishing features. First, we train translation models\nfrom unlabeled data through reinforcement learning. Our work signi\ufb01cantly reduces the requirement\non the aligned bilingual data, and it opens a new window to learn to translate from scratch (i.e., even\nwithout using any parallel data). Experimental results show that our method is very promising.\nSecond, we demonstrate the power of deep reinforcement learning (DRL) for complex real-world\napplications, rather than just games. Deep reinforcement learning has drawn great attention in recent\nyears. However, most of them today focus on video or board games, and it remains a challenge to\nenable DRL for more complicated applications whose rules are not pre-de\ufb01ned and where there is\nno explicit reward signals. Dual learning provides a promising way to extract reward signals for\nreinforcement learning in real-world applications like machine translation.\nThe remaining parts of the paper are organized as follows. In Section 2, we brie\ufb02y review the\nliterature of neural machine translation. After that, we introduce our dual-learning algorithm for\nneural machine translation. The experimental results are provided and discussed in Section 4. We\nextend the breadth and depth of dual learning in Section 5 and discuss future work in the last section.\n\n2 Background: Neural Machine Translation\n\nIn principle, our dual-learning framework can be applied to both phrase-based statistical machine\ntranslation and neural machine translation. In this paper, we focus on the latter one, i.e., neural\n\n2\n\n\fmachine translation (NMT), due to its simplicity as an end-to-end system, without suffering from\nhuman crafted engineering [5].\nNeural machine translation systems are typically implemented with a Recurrent Neural Network (RN-\nN) based encoder-decoder framework. Such a framework learns a probabilistic mapping P (y|x) from\na source language sentence x = {x1, x2, ..., xTx} to a target language sentence y = {y1, y2, ..., yTy}\n, in which xi and yt are the i-th and t-th words for sentences x and y respectively.\nTo be more concrete, the encoder of NMT reads the source sentence x and generates Tx hidden states\nby an RNN:\n\nhi = f (hi\u22121, xi)\n\n(1)\nin which hi is the hidden state at time i, and function f is the recurrent unit such as Long Short-Term\nMemory (LSTM) unit [12] or Gated Recurrent Unit (GRU) [3]. Afterwards, the decoder of NMT\ncomputes the conditional probability of each target word yt given its proceeding words y<t, as well\nas the source sentence, i.e., P (yt|y<t, x), which is then used to specify P (y|x) according to the\nprobability chain rule. P (yt|y<t, x) is given as:\n\nP (yt|y<t, x) \u221d exp(yt; rt, ct)\nrt = g(rt\u22121, yt\u22121, ct)\nct = q(rt\u22121, h1,\u00b7\u00b7\u00b7 , hTx )\n\n(2)\n(3)\n(4)\nwhere rt is the decoder RNN hidden state at time t, similarly computed by an LSTM or GRU, and ct\ndenotes the contextual information in generating word yt according to different encoder hidden states.\nct can be a \u2018global\u2019 signal summarizing sentence x [3, 12], e.g., c1 = \u00b7\u00b7\u00b7 = cTy = hTx, or \u2018local\u2019\n(cid:80)\nj exp{a(hj ,rt\u22121)},\n\nsignal implemented by an attention mechanism [1], e.g., ct =(cid:80)Tx\n\nwhere a(\u00b7,\u00b7) is a feed-forward neural network.\nWe denote all the parameters to be optimized in the neural network as \u0398 and denote D as the dataset\nthat contains source-target sentence pairs for training. Then the learning objective is to seek the\noptimal parameters \u0398\u2217:\n\ni=1 \u03b1ihi, \u03b1i = exp{a(hi,rt\u22121)}\n\n(cid:88)\n\nTy(cid:88)\n\n(x,y)\u2208D\n\nt=1\n\n\u0398\u2217 = argmax\n\n\u0398\n\nlog P (yt|y<t, x; \u0398)\n\n(5)\n\n3 Dual Learning for Neural Machine Translation\n\nIn this section, we present the dual-learning mechanism for neural machine translation. Noticing\nthat MT can (always) happen in dual directions, we \ufb01rst design a two-agent game with a forward\ntranslation step and a backward translation step, which can provide quality feedback to the dual\ntranslation models even using monolingual data only. Then we propose a dual-learning algorithm,\ncalled dual-NMT, to improve the two translation models based on the quality feedback provided in\nthe game.\nConsider two monolingual corpora DA and DB which contain sentences from language A and B\nrespectively. Please note these two corpora are not necessarily aligned with each other, and they may\neven have no topical relationship with each other at all. Suppose we have two (weak) translation\nmodels that can translate sentences from A to B and verse visa. Our goal is to improve the accuracy\nof the two models by using monolingual corpora instead of parallel corpora. Our basic idea is to\nleverage the duality of the two translation models. Starting from a sentence in any monolingual data,\nwe \ufb01rst translate it forward to the other language and then further translate backward to the original\nlanguage. By evaluating this two-hop translation results, we will get a sense about the quality of the\ntwo translation models, and be able to improve them accordingly. This process can be iterated for\nmany rounds until both translation models converge.\nSuppose corpus DA contains NA sentences, and DB contains NB sentences. Denote P (.|s; \u0398AB)\nand P (.|s; \u0398BA) as two neural translation models, where \u0398AB and \u0398BA are their parameters (as\ndescribed in Section 2).\nAssume we already have two well-trained language models LMA(.) and LMB(.) (which are easy to\nobtain since they only require monolingual data), each of which takes a sentence as input and outputs\n\n3\n\n\fAlgorithm 1 The dual-learning algorithm\n1: Input: Monolingual corpora DA and DB, initial translation models \u0398AB and \u0398BA, language\n\nmodels LMA and LMB, \u03b1, beam search size K, learning rates \u03b31,t, \u03b32,t .\n\n2: repeat\n3:\n4:\n5:\n6:\n\nt = t + 1.\nSample sentence sA and sB from DA and DB respectively.\n(cid:46) Model update for the game beginning from A.\nSet s = sA.\nGenerate K sentences smid,1, . . . , smid,K using beam search according to translation model\nP (.|s; \u0398AB).\n\nfor k = 1, . . . , K do\n\nlog P (s|smid,k; \u0398BA).\n\nthe communication reward for\n\nSet the language-model reward for the kth sampled sentence as r1,k = LMB(smid,k).\nSet\nSet the total reward of the kth sample as rk = \u03b1r1,k + (1 \u2212 \u03b1)r2,k.\n\nthe kth sampled sentence as r2,k =\n\nend for\nCompute the stochastic gradient of \u0398AB:\n\n7:\n8:\n9:\n\n10:\n11:\n12:\n\n13:\n\n\u2207\u0398AB\n\n\u02c6E[r] =\n\n1\nK\n\n[rk\u2207\u0398AB log P (smid,k|s; \u0398AB)].\n\nCompute the stochastic gradient of \u0398BA:\n\n\u2207\u0398BA\n\n\u02c6E[r] =\n\n1\nK\n\n14:\n\nModel updates:\n\n[(1 \u2212 \u03b1)\u2207\u0398BA log P (s|smid,k; \u0398BA)].\n\nK(cid:88)\n\nk=1\n\nK(cid:88)\n\nk=1\n\n\u0398AB \u2190 \u0398AB + \u03b31,t\u2207\u0398AB\n\n\u02c6E[r], \u0398BA \u2190 \u0398BA + \u03b32,t\u2207\u0398BA\n\n\u02c6E[r].\n\nSet s = sB.\nGo through line 6 to line 14 symmetrically.\n\n15:\n16:\n17: until convergence\n\n(cid:46) Model update for the game beginning from B.\n\na real value to indicate how con\ufb01dent the sentence is a natural sentence in its own language. Here the\nlanguage models can be trained either using other resources, or just using the monolingual data DA\nand DB.\nFor a game beginning with sentence s in DA, denote smid as the middle translation output. This\nmiddle step has an immediate reward r1 = LMB(smid), indicating how natural the output sentence\nis in language B. Given the middle translation output smid, we use the log probability of s recovered\nfrom smid as the reward of the communication (we will use reconstruction and communication\ninterchangeably). Mathematically, reward r2 = log P (s|smid; \u0398BA).\nWe simply adopt a linear combination of the LM reward and communication reward as the total\nreward, e.g., r = \u03b1r1 + (1 \u2212 \u03b1)r2, where \u03b1 is a hyper-parameter. As the reward of the game can\nbe considered as a function of s, smid and translation models \u0398AB and \u0398BA, we can optimize the\nparameters in the translation models through policy gradient methods for reward maximization, as\nwidely used in reinforcement learning [13].\nWe sample smid according to the translation model P (.|s; \u0398AB). Then we compute the gradient of\nthe expected reward E[r] with respect to parameters \u0398AB and \u0398BA. According to the policy gradient\ntheorem [13], it is easy to verify that\n\n\u2207\u0398BAE[r] = E[(1 \u2212 \u03b1)\u2207\u0398BA log P (s|smid; \u0398BA)]\n\n\u2207\u0398AB E[r] = E[r\u2207\u0398AB log P (smid|s; \u0398AB)]\n\n(6)\n\n(7)\n\nin which the expectation is taken over smid.\nBased on Eqn.(6) and (7), we can adopt any sampling approach to estimate the expected gradient.\nConsidering that random sampling brings very large variance and sometimes unreasonable results in\n\n4\n\n\fTable 1: Translation results of En\u2194Fr task. The results of the experiments using all the parallel data\nfor training are provided in the \ufb01rst two columns (marked by \u201cLarge\u201d), and the results using 10%\nparallel data for training are in the last two columns (marked by \u201cSmall\u201d).\nEn\u2192Fr (Small)\n\nFr\u2192En (Small)\n\nEn\u2192Fr (Large)\n\nFr\u2192En (Large)\n\nNMT\n\npseudo-NMT\ndual-NMT\n\n29.92\n30.40\n32.06\n\n27.49\n27.66\n29.78\n\n25.32\n25.63\n28.73\n\n22.27\n23.24\n27.50\n\nmachine translation [9, 12, 10], we use beam search [12] to obtain more meaningful results (more\nreasonable middle translation outputs) for gradient computation, i.e., we greedily generate top-K\nhigh-probability middle translation outputs, and use the averaged value on the beam search results\nto approximate the true gradient. If the game begins with sentence s in DB, the computation of the\ngradient is just symmetric and we omit it here.\nThe game can be repeated for many rounds. In each round, one sentence is sampled from DA and\none from DB, and we update the two models according to the game beginning with the two sentences\nrespectively. The details of this process are given in Algorithm 1.\n\n4 Experiments\n\nWe conducted a set of experiments to test the proposed dual-learning mechanism for neural machine\ntranslation.\n\n4.1 Settings\n\nWe compared our dual-NMT approach with two baselines: the standard neural machine translation\n[1] (NMT for short), and a recent NMT-based method [11] which generates pseudo bilingual sentence\npairs from monolingual corpora to assist training (pseudo-NMT for short). We leverage a tutorial\nNMT system implemented by Theano for all the experiments. 2\nWe evaluated our algorithm on the translation task of a pair of languages: English\u2192French (En\u2192Fr)\nand French\u2192English (Fr\u2192En).\nIn detail, we used the same bilingual corpora from WMT\u201914\nas used in [1, 5], which contains 12M sentence pairs extracting from \ufb01ve datasets: Europarl v7,\nCommon Crawl corpus, UN corpus, News Commentary, and 109French-English corpus. Following\ncommon practices, we concatenated newstest2012 and newstest2013 as the validation set, and used\nnewstest2014 as the testing set. We used the \u201cNews Crawl: articles from 2012\u201d provided by WMT\u201914\nas monolingual data.\nWe used the GRU networks and followed the practice in [1] to set experimental parameters. For each\nlanguage, we constructed the vocabulary with the most common 30K words in the parallel corpora,\nand out-of-vocabulary words were replaced with a special token <UNK>. For monolingual corpora,\nwe removed the sentences containing at least one out-of-vocabulary words. Each word was projected\ninto a continuous vector space of 620 dimensions, and the dimension of the recurrent unit was 1000.\nWe removed sentences with more than 50 words from the training set. Batch size was set as 80 with\n20 batches pre-fetched and sorted by sentence lengths.\nFor the baseline NMT model, we exactly followed the settings reported in [1]. For the baseline\npseudo-NMT [11], we used the trained NMT model to generate pseudo bilingual sentence pairs from\nmonolingual data, removed the sentences with more than 50 words, merged the generated data with\nthe original parallel training data, and then trained the model for testing. Each of the baseline models\nwas trained with AdaDelta [15] on K40m GPU until their performances stopped to improve on the\nvalidation set.\nOur method needs a language model for each language. We trained an RNN based language model\n[7] for each language using its corresponding monolingual corpus. Then the language model was\n\n2dl4mt-tutorial: https://github.com/nyu-dl\n\n5\n\n\fTable 2: Reconstruction performance of En\u2194Fr task\nEn\u2192Fr\u2192En (S)\n\nFr\u2192En\u2192Fr (L)\n\nEn\u2192Fr\u2192En (L)\n\nNMT\n\npseudo-NMT\ndual-NMT\n\n39.92\n38.15\n51.84\n\n45.05\n45.41\n54.65\n\n28.28\n30.07\n48.94\n\nFr\u2192En\u2192Fr (S)\n\n32.63\n34.54\n50.38\n\n\ufb01xed and the log likelihood of a received message was used to reward the communication channel\n(i.e., the translation model) in our experiments.\nWhile playing the game, we initialized the channels using warm-start translation models (e.g., trained\nfrom bilingual data corpora), and see whether dual-NMT can effectively improve the machine\ntranslation accuracy. In our experiments, in order to smoothly transit from the initial model trained\nfrom bilingual data to the model training purely from monolingual data, we adopted the following\nsoft-landing strategy. At the very beginning of the dual learning process, for each mini batch, we\nused half sentences from monolingual data and half sentences from bilingual data (sampled from\nthe dataset used to train the initial model). The objective was to maximize the weighted sum of the\nreward based on monolingual data de\ufb01ned in Section 3 and the likelihood on bilingual data de\ufb01ned in\nSection 2. When the training process went on, we gradually increased the percentage of monolingual\nsentences in the mini batch, until no bilingual data were used at all. Speci\ufb01cally, we tested two\nsettings in our experiments:\n\n\u2022 In the \ufb01rst setting (referred to Large), we used all the 12M bilingual sentences pairs during\nthe soft-landing process. That is, the warm start model was learnt based on full bilingual\ndata.\n\n\u2022 In the second setting (referred to Small), we randomly sampled 10% of the 12M bilingual\n\nsentences pairs and used them during the soft-landing process.\n\nFor each of the settings we trained our dual-NMT algorithm for one week. We set the beam search\nsize to be 2 in the middle translation process. All the hyperparameters in the experiments were set by\ncross validation.We used the BLEU score [8] as the evaluation metric, which are computed by the\nmulti-bleu.perl script3. Following the common practice, during testing we used beam search [12]\nwith beam size of 12 for all the algorithms as in many previous works.\n\n4.2 Results and Analysis\nWe report the experimental results in this section. Recall that the two baselines for English\u2192French\nand French\u2192English are trained separately while our dual-NMT conducts joint training. We sum-\nmarize the overall performances in Table 1 and plot the BLEU scores with respect to the length of\nsource sentences in Figure 1.\nFrom Table 1 we can see that our dual-NMT algorithm outperforms the baseline algorithms in all\nthe settings. For the translation from English to French, dual-NMT outperforms the baseline NMT\nby about 2.1/3.4 points for the \ufb01rst/second warm start setting, and outperforms pseudo-NMT by\nabout 1.7/3.1 points for both settings. For the translation from French to English, the improvement is\nmore signi\ufb01cant: our dual-NMT outperforms NMT by about 2.3/5.2 points for the \ufb01rst/second warm\nstart setting, and outperforms pseudo-NMT by about 2.1/4.3 points for both settings. Surprisingly,\nwith only 10% bilingual data, dual-NMT achieves comparable translation accuracy as vanilla NMT\nusing 100% bilingual data for the Fr\u2192En task. These results demonstrate the effectiveness of our\ndual-NMT algorithm. Furthermore, we have the following observations:\n\n\u2022 Although pseudo-NMT outperforms NMT, its improvements are not very signi\ufb01cant. Our\nhypothesis is that the quality of pseudo bilingual sentence pairs generated from the monolin-\ngual data is not very good, which limits the performance gain of pseudo-NMT. One might\nneed to carefully select and \ufb01lter the generated pseudo bilingual sentence pairs to get better\nperformance for pseudo-NMT.\n\n3https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl\n\n6\n\n\fTable 3: Cases study of the translation-back-translation (TBT) performance during dual-NMT training\n\nSource (En)\n\nEn\u2192Fr\n\nEn\u2192Fr\u2192En\n\nSource (Fr)\n\nFr\u2192En\n\nFr\u2192En\u2192Fr\n\nTranslation-back-translation results\nafter dual-NMT training\n\nTranslation-back-translation results\nbefore dual-NMT training\nThe majority of the growth in the years to come will come from its\nlique\ufb01ed natural gas schemes in Australia.\nLa plus grande partie de la crois-\n-sance des ann\u00e9es \u00e0 venir viendra\nde ses syst\u00e8mes de gaz naturel\nliqu\u00e9\ufb01\u00e9 en Australie .\nMost of the growth of future\nyears will come from its lique\ufb01ed\nnatural gas systems in Australia .\n\nLa majorit\u00e9 de la croissance dans\nles ann\u00e9es \u00e0 venir viendra de ses\nr\u00e9gimes de gaz naturel liqu\u00e9\ufb01\u00e9\nen Australie .\nThe majority of growth in the\ncoming years will come from its\nlique\ufb01ed natural gas systems\nin Australia .\n\nIl pr\u00e9cise que &quot; les deux cas identi\ufb01\u00e9s en mai 2013 restent donc\nles deux seuls cas con\ufb01rm\u00e9s en France \u00e0 ce jour \" .\nHe noted that \" the two cases\nidenti\ufb01ed in May 2013 therefore\nremain the only two two con\ufb01rmed\ncases in France to date \" .\nIl a not\u00e9 que \" les deux cas\nidenti\ufb01\u00e9sen mai 2013 demeurent\ndonc les deux seuls deux deux cas\ncon\ufb01rm\u00e9s en France \u00e0 ce jour \"\n\nHe states that \" the two cases\nidenti\ufb01ed in May 2013 remain the\nonly two con\ufb01rmed cases in France\nto date \"\nIl pr\u00e9cise que \" les deux cas\nidenti\ufb01\u00e9s en mai 2013 restent les\nseuls deux cas con\ufb01rm\u00e9s en France\n\u00e0 ce jour \".\n\n\u2022 When the parallel bilingual data are small, dual-NMT makes larger improvement. This\nshows that the dual-learning mechanism makes very good utilization of monolingual data.\nThus we expect dual-NMT will be more helpful for language pairs with smaller labeled\nparallel data. Dual-NMT opens a new window to learn to translate from scratch.\n\nWe plot BLEU scores with respect to the length of source sentences in Figure 1. From the \ufb01gure, we\ncan see that our dual-NMT algorithm outperforms the baseline algorithms in all the ranges of length.\nWe make some deep studies on our dual-NMT algorithm in Table 2. We study the self-reconstruction\nperformance of the algorithms: For each sentence in the test set, we translated it forth and back using\nthe models and then checked how close the back translated sentence is to the original sentence using\nthe BLEU score. We also used beam search to generate all the translation results. It can be easily\nseen from Table 2 that the self-reconstruction BLEU scores of our dual-NMT are much higher than\nNMT and pseudo-NMT. In particular, our proposed method outperforms NMT by about 11.9/9.6\npoints when using warm-start model trained on large parallel data, and outperforms NMT for about\n20.7/17.8 points when using the warm-start model trained on 10% parallel data.\nWe list several example sentences in Table 3 to compare the self-reconstruction results of models\nbefore and after dual learning. It is quite clear that after dual learning, the reconstruction is largely\nimproved for both directions, i.e., English\u2192French\u2192English and French\u2192English\u2192French.\nTo summarize, all the results show that the dual-learning mechanism is promising and better utilizes\nthe monolingual data.\n\n5 Extensions\n\nIn this section, we discuss the possible extensions of our proposed dual learning mechanism.\n\n7\n\n\fFirst, although we have focused on machine translation in this work, the basic idea of dual learning is\ngenerally applicable: as long as two tasks are in dual form, we can apply the dual-learning mechanism\nto simultaneously learn both tasks from unlabeled data using reinforcement learning algorithms.\nActually, many AI tasks are naturally in dual form, for example, speech recognition versus text\nto speech, image caption versus image generation, question answering versus question generation\n(e.g., Jeopardy!), search (matching queries to documents) versus keyword extraction (extracting\nkeywords/queries for documents), so on and so forth. It would very be interesting to design and test\ndual-learning algorithms for more dual tasks beyond machine translation.\nSecond, although we have focused on dual learning on two tasks, our technology is not restricted to\ntwo tasks only. Actually, our key idea is to form a closed loop so that we can extract feedback signals\nby comparing the original input data with the \ufb01nal output data. Therefore, if more than two associated\ntasks can form a closed loop, we can apply our technology to improve the model in each task from\nunlabeled data. For example, for an English sentence x, we can \ufb01rst translate it to a Chinese sentence\ny, then translate y to a French sentence z, and \ufb01nally translate z back to an English sentence x(cid:48). The\nsimilarity between x and x(cid:48) can indicate the effectiveness of the three translation models in the loop,\nand we can once again apply the policy gradient methods to update and improve these models based\non the feedback signals during the loop. We would like to name this generalized dual learning as\nclose-loop learning, and will test its effectiveness in the future.\n\n(a) En\u2192Fr\n\n(b) Fr\u2192En\n\nFigure 1: BLEU scores w.r.t lengths of source sentences\n\n6 Future Work\n\nWe plan to explore the following directions in the future. First, in the experiments we used bilingual\ndata to warm start the training of dual-NMT. A more exciting direction is to learn from scratch, i.e.,\nto learn translations directly from monolingual data of two languages (maybe plus lexical dictionary).\nSecond, our dual-NMT was based on NMT systems in this work. Our basic idea can also be applied\nto phrase-based SMT systems and we will look into this direction. Third, we only considered a pair\nof languages in this paper. We will extend our approach to jointly train multiple translation models\nfor a tuple of 3+ languages using monolingual data.\n\nAcknowledgement\n\nThis work was partially supported by National Basic Research Program of China (973 Program)\n(grant no. 2015CB352502), NSFC (61573026) and the MOE\u2013Microsoft Key Laboratory of Statistics\nand Machine Learning, Peking University. We would like to thank Yiren Wang, Fei Tian, Li Zhao\nand Wei Chen for helpful discussions, and the anonymous reviewers for their valuable comments on\nour paper.\n\n8\n\n<10[10,20)[20,30)[30,40)[40,50)[50,60)>6016182022242628303234Source Sentence LengthBLEU NMT (Large)dual\u2212NMT (Large)NMT (Small)dual\u2212NMT (Small)<10[10,20)[20,30)[30,40)[40,50)[50,60)>60161820222426283032Source Sentence LengthBLEU NMT (Large)dual\u2212NMT (Large)NMT (Small)dual\u2212NMT (Small)\fReferences\n[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. ICLR, 2015.\n\n[2] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine\nIn In Proceedings of the Joint Conference on Empirical Methods in Natural\n\ntranslation.\nLanguage Processing and Computational Natural Language Learning. Citeseer, 2007.\n\n[3] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using rnn encoder\u2013decoder for statistical machine\ntranslation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan-\nguage Processing (EMNLP), pages 1724\u20131734, Doha, Qatar, October 2014. Association for\nComputational Linguistics.\n\n[4] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and\nY. Bengio. On using monolingual corpora in neural machine translation. arXiv preprint\narXiv:1503.03535, 2015.\n\n[5] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for\nneural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for\nComputational Linguistics and the 7th International Joint Conference on Natural Language\nProcessing (Volume 1: Long Papers), pages 1\u201310, Beijing, China, July 2015. Association for\nComputational Linguistics.\n\n[6] P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proceedings of the\n2003 Conference of the North American Chapter of the Association for Computational Linguis-\ntics on Human Language Technology-Volume 1, pages 48\u201354. Association for Computational\nLinguistics, 2003.\n\n[7] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network\n\nbased language model. In INTERSPEECH, volume 2, page 3, 2010.\n\n[8] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of\nmachine translation. In Proceedings of the 40th annual meeting on association for computational\nlinguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n[9] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural\n\nnetworks. arXiv preprint arXiv:1511.06732, 2015.\n\n[10] A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence\nsummarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural\nLanguage Processing, pages 379\u2013389, Lisbon, Portugal, September 2015. Association for\nComputational Linguistics.\n\n[11] R. Sennrich, B. Haddow, and A. Birch. Improving neural machine translation models with\n\nmonolingual data. In ACL, 2016.\n\n[12] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[13] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for\nreinforcement learning with function approximation. In NIPS, volume 99, pages 1057\u20131063,\n1999.\n\n[14] N. Uef\ufb01ng, G. Haffari, and A. Sarkar. Semi-supervised model adaptation for statistical machine\n\ntranslation. Machine Translation Journal, 2008.\n\n[15] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n9\n\n\f", "award": [], "sourceid": 501, "authors": [{"given_name": "Di", "family_name": "He", "institution": "Microsoft"}, {"given_name": "Yingce", "family_name": "Xia", "institution": "USTC"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft"}, {"given_name": "Liwei", "family_name": "Wang", "institution": "Peking University"}, {"given_name": "Nenghai", "family_name": "Yu", "institution": "USTC"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft"}, {"given_name": "Wei-Ying", "family_name": "Ma", "institution": "Microsoft"}]}