Review for NeurIPS paper: Data Diversification: A Simple Strategy For Neural Machine Translation

NeurIPS 2020

Data Diversification: A Simple Strategy For Neural Machine Translation

Review 1

Summary and Contributions: This paper investigates data diversification for neural machine translation. The authors augment data from parallel corpus (x, y), which is training data, and build forward translation data (x, y') and back-translation data (x', y) by utilizing source side data x and target side data y. The proposed method achieved the state-of-the-art performance in WMT'14 English-German and English-French translation task, and substantially improves on 8 other translation tasks.

Strengths: The authors showed that even parallel corpora can be useful to augment data, since the existing research work discussed monolingual usage. The idea is straightforward, and the proposed method gives +1-2 BLEU score improvement on many language translation tasks. Analysis parts shows some evidence of why the proposed approach works.

Weaknesses: - In Table 2, what about the performance of vanilla Transformer with the proposed approach? It's clearer to report the baseline + proposed approach, not only aiming at reporting state-of-the-art performance. - In Figure 1, the reported perplexities are over 30, which looks pretty high. This high perplexity contradicts better BLEU scores in my experience. How did you calculate perplexity?

Correctness: - In Table 9, the authors uses BT data only from News Crawl 2009, but there should be more recent available monolingual data such as 2010-2014. Why didn't you use these data set? One of the positive effects of BT data is to introduce new training examples with noisy inputs and human-natural outputs, so intuitively the decoder will be updated with diverse topic and data. Why do you think the proposed diversification strategy provides similar improvements as BT data? - Do you think if this approach has regularization effect by augmented data on parallel corpus? - How did you create a vocabulary? is it separate or joint?

Clarity: Tables are located far from the description in the content, e.g. Table 6 appears in page 6 but it was mentioned in page 8. Please fix this allocation as much as possible for readability. Typo: - l. 247 and and learns -> and learns

Relation to Prior Work: None.

Reproducibility: Yes

Additional Feedback: UPDATE: Thank you for providing the authors response. I'm not convinced well with the experimental design and the conclusion that the method complements BT. I decreased my score 6->5.

Review 2

Summary and Contributions: Paper proposes a data augmentation technique for neural machine translation that incorporates the use of multiple models trained on a dataset and harvest the ensembling effect of each of these models, via data diversification. Proposed method starts with training multiple left-to-right and right-to-left models using the supervised data at hand, and then goes into an iterative phase of augmenting more and more data using previous models and at the end training a final left-to-right model. The loop variable "k" determines the diversification factor, by setting how many iterates needs to be completed to augment the final dataset. Paper relates the proposed approach to data augmentation, back-translation, ensemble methods, knowledge distillation and multi-agent dual learning; later provides empirical evidence on multiple datasets, high-mid-low resource MT tasks and at last shares an analysis section laying out the factors playing role in the proposed approach.

Strengths: - Simple approach to implement as an outer loop using any NMT architecture/variant.

Weaknesses: - lack of diversification details (which is hinted in the last section, but not mentioned explicitly) - ignores the translationese effect of forward/backward translation in BLEU scores - although the proposed approach is advertised as a cost efficient solution, the inference cost is not mentioned in the paper, which is expected to be substantially big.

Correctness: 1. Without separating out the translationese effect [1,2] provided BLEU scores will be hard to interpret. Backward-and-forward translation changes (simplifies) the generated output, which is then used to train consecutive models. If at the end, these models are evaluated on test sets that are not natural, but translationese (which is also simpler) then this inflates the BLEU scores. 2. In general the BLEU score improvements are very minor, esp. Table 5. sometimes 0.1 BLEU improvement. In addition, Transformer models are known to have high variance (+-0.5 BLEU) [3], if the proposed approach is not using checkpoint averaging or similar smoothing methods, then the improvements shown in the paper could be spurious. 3. Section 5.1, the formulation ignores the optimization effect, and assumes that the models have perfect fit on the available data [1] https://arxiv.org/abs/1908.05204 [2] https://arxiv.org/abs/1911.03823 [3] https://arxiv.org/abs/1804.09849

Clarity: Paper is clear, easy to read and digest.

Relation to Prior Work: May consider covering the literature on "self-training" [1] which resembles the proposed approach. In terms of data diversification, a similar approach is in [2]. For the model architectures, please consider mentioning hybrid approaches [3] [1] https://openreview.net/pdf?id=SJgdnAVKDH [2] https://arxiv.org/abs/1909.11861 [3] https://arxiv.org/abs/1804.09849

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper boosts the performance of NMT by proposing a data diversifying method. It uses forward and backward models to generated samples and merges them with original dataset. The method achieves state-of-the-art BLEU scores in the WMT’14 English-German and English-French translation tasks.

Strengths: The proposed method is a combination of back-translation, ensemble of models, data augmentation and knowledge distillation for NMT. It also gives a detailed analysis of data diversification and conducts a study on hyperparameters and back-translation.

Weaknesses: The novelty of this paper is limited. All the strategies including back-translation are common methods in NMT. This paper only combines them together to achieve high performance without proposing a novel model or architecture. Maybe it’s a very good method to participate in the WMT competition.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Overall, the paper is well written, and model structure and training details are clearly presented. It also gives a detailed analysis of data diversification and conducts a study on hyperparameters and back-translation. However, all the strategies presented in the paper are common methods in NMT, including back-translation and dual learning. In my view, the novelty of this paper has not reached an acceptable level for NIPS conference. There are some missing references: 1. Achieving human parity on automatic chinese to english news translation. Hassan et al. 2018. 2. Iterative Back-Translation for Neural Machine Translation. Hoang et al. 2018. 3. Joint training for neural machine translation models with monolingual data. Zhang et al. 2018. 4. Regularizing neural machine translation by target-bidirectional agreement. Zhang et al. 2019. 5.Synchronous Bidirectional Neural Machine Translation. Zhou et al. 2019. Questions: 1. The proposed method actually can be improved with right-to-left NMT model or bidirectional NMT model, as shown in Hassan et al. (2018). Have you combined the proposed method with right-to-left or bidirectional NMT models? What is the main different between your method and Hassan et al. (2018)?

Review 4

Summary and Contributions: This work describes a simple approach to synthetically augment the training dataset for neural machine translation. The proposed approach involves training multiple forward and backward MT models and appending their outputs on the original training dataset to the training data. This augmented (or diversified) training dataset can then be used to train the next generation of models. 1. The proposed approach is evaluated on the WMT'14 En<->Fr, En<->De, IWSLT En<->Fr, En<->De and Flores En<->Ne and En<->Si tasks, covering a wide range of resource sizes. The approach seems to improve by >1 BLEU over baselines trained directly on the natural dataset on most experiments. 2. The approach is compared against ensembles and baselines trained without data diversification. 3. The approach is shown to improve further when combined with back-translation on additional monolingual data. 4. Additional analysis to demonstrate the effect of the approach on the model's perplexity, ablations to study the effect of varying initial parameters for intermediate models and the effect of including forward translated data on the final model quality. Edit after author response: I would have still liked to see human evaluations to validate the BLEU gains given the extensive use of synthetic data in this paper, but given the additional analysis on translationese I'm updating my score to 6.

Strengths: 1. The described approach is simple, independent of the underlying model architecture and is applicable to any NMT task as long as there's an initial bilingual training set available. The observed improvement on test BLEU is significant. 2. Careful study of various factors, like controlling for parameter initialization, forward translation and perplexity to understand the effect of the approach on model training.

Weaknesses: While the described approach is simple and very generally applicable, there are some major issues with the evaluation that need to be addressed. If 1. and 2. are addressed I would be willing to update my scores. 1. The BLEU evaluation is not clearly described for the WMT and IWSLT experiments. Given the major variations observed in BLEU scores due to differences in post-processing or the BLEU evaluation script used, it's hard to fairly compare against previous work without clearly describing the post-processing, tokenization and BLEU evaluation tool used for these experiments. [1] 2. When training with synthetic data, BLEU scores are an unreliable measure of translation quality due to the translationese effects present in several standard test sets [2,3 and several follow-up works]. Since the proposed method relies heavily on using backward and forward translated data, these effects are bound to affect the observed BLEU improvements. A careful study of the effect of this approach on the forward and backward translated subsets of the evaluation sets and ideally evaluating translation quality with human raters on the two subsets should address this concern. Other minor concerns: 3. Table 1. suggests that the amount of "actual" training data required for Back-translation and NMT+BERT is much higher. This seems misleading given that those approaches rely on unlabeled data which is readily available. 4. Missing reference and discussion on self-training. [4] 5. Additional analysis on out-of-domain generalization (or out-of-domain test sets) would be nice to have. Is augmentation with smooth synthetic data affecting the generalization ability of the model beyond the test sets being used? References: [1] A Call for Clarity in Reporting BLEU Scores, Post et al. [2] APE at Scale and its Implications on MT Evaluation Biases, Freitag et al. [3] On The Evaluation of Machine Translation Systems Trained With Back-Translation, Edunov et al. [4] REVISITING SELF-TRAINING FOR NEURAL SEQUENCE GENERATION, He et al.

Correctness: Yes, no major issues with the method except for the evaluation methodology discussed above.

Clarity: Yes, the paper is easy to read and understand.

Relation to Prior Work: Some missing discussion and references on self-training, but otherwise adequate.

Reproducibility: Yes

Additional Feedback: