NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID:3001
Title:MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models

Reviewer 1

Update after Author Feedback: After reading all the reviews and the author feedback, I have two overall comments. The paper is branded as a transfer learning paper, but I'm left disappointed in this respect. I find it very surprising that the attention can be transferred at all, but it is such a small contribution to the MacNet Architecture's overall improvements, that it seems a hard sell. Focal losses have been used before and encoders have been transferred before, but they also contribute to performance improvements... Second comment: the ablations on summarization are necessary for a camera-ready version -- that seems like a hole right now, so I hope they are included in future versions. Overall, I'm still a 6 because you find a combination of things (with some surprising novelty) that improve performance, and it has shown that I should experiment with those things in the future. Original Review: Recent work (McCann et al 2017) had shown that the context vectors (CoVe) of LSTMs trained as encoders for machine translation contained useful information for other tasks. The authors of this work show that CoVe from question answering encoders can be similarly used to augment performance of other tasks (Eq. 5). More surprisingly, and perhaps more relevant to the community, they also show that very specific higher layers of NLP models trained on one task can be transferred successfully to perform similar functions for other tasks (Eq. 12). This transfer is from question answering (machine comprehension) to both machine translation and summarization, substantial tasks, and experiments show that this transfer improves performance on both. They also demonstrate the increased effectiveness of using a focal loss inspired loss function (Eq. 15). Pros: While reading Section 4 MacNet Architecture, I found myself taking notes on the kinds of ablations I would want to see, and I was glad to find most of these ablations in Section 5. The information conveyed by the writing was clear. MacNet as a package seems to clearly be superior to not using MacNet. Not really a pro or a con: I spent a while puzzling over the transfer of G_{enc}. It just isn’t clear to me why the authors would not use the same kinds of inputs G_{enc} was trained with, and the authors don’t justify why transferring by concatenation where the authors chose to is better than using those representations as additional embeddings. Cons: I worry a bit that the paper focuses mostly on discussion of transfer learning, but then nearly as much of the gain in performance for NMT comes from the alternative loss function in Eq. 15. In summarization, the performance gains are about 1 point across the board, but again, is this really just because of the alternative loss function? I would prefer to see the NMT ablations for summarization as well. The second biggest contributor to gains seems to be the Encoding layer for NMT, but this is the transfer that we shouldn’t be surprised would work given the prior work in this direction. That the Modeling layer helps at all is surprising, but on its own its at most it gives a gain of .4 on NMT; it is not clear how many runs it took to get that .4, so it seems possible that the transfer of the modeling layer doesn’t help much at all for most training runs. Overall: I’m left somewhat skeptical that any of the three contributions on their own are really very significant, but all together as MacNet, they work, there is one surprise, and I learned that I should experiment with a new loss for sequence generation tasks. I tend to think this makes for a good submission, but I’m skeptical about the significance of the transfer learning of the modeling layer. Given that seems to be the biggest novelty to me, this means I’m a bit hesitant about the overall contribution as well, which leads me to give it a 6. I’ve included some more detailed comments below with some additional suggestions and clarification questions. Given that the empirical results aren’t there yet to clear up some of my concerns, my confidence is 4, but I look forward to hearing from the authors during the rebuttal phase. Combinations: “We use a simple MLP classifier on the combination of {mj}nj=1 and G to locate the start and end positions of the answer.” (line 13) What is the combination here? Concatenation? Summation? Something else? Suggested citation: Question Answering through Transfer Learning from Large Fine-grained Supervision Data (Min et al. 2017) In that work, they show that pretraining on question answering datasets like SQuAD can benefit downstream tasks like SNLI. Its clearly very different from your work in key ways, but it is related in that it shows how QA models trained on SQuAD can transfer well to other tasks. Citation spacing: It looks like all of your citations would be better off with a space between them and the preceding text. Try using ~\cite instead of \cite (or corresponding ~\citep and ~\citet) Pointer-Generator Comparison: Are the examples in Table 5 randomly chosen? This would be worth mentioning. Otherwise, many will assume you’ve cherry-picked examples that make your model look better. If that is what you did, well… I would suggest randomly choosing them to be more fair. Transparency in Equations: In Eq. 2, you use f_{att} and cite Set et al. 2017, but it would be nice to include those equations in an appendix so that readers don’t have to then go lookup another paper just to see how your model is actually implemented. Embeddings, tokenization, etc.: Character-level embeddings are obtained by training a CNN over the characters of each word (lines 111-112). G_{enc} is then trained on GloVe and character embeddings from that CNN, which are summarized as f_{rep}. (Eq. 1) Then in Eq. 5, G_{enc} is used to take in Emb(x_s). The authors say that a fully connected layer transforms Emb(x_s) to the right size for G_{enc}, but this is left out of Eq. 5. It should be included there to avoid confusion. This fully connected transformation is confusing because for NMT, unlike for MC, the data is split into subword units, so it is not clear that G_{enc} would have even seen many of the newly tokenized forms of the data. Why not use f_{rep} again for the inputs? How many subword units were used? Enough to make sure that G_{enc} still sees mostly words that it saw during MC training? Similar questions arise again for the Pointer-Generator + MacNet. It seems like that fully connected layer has to learn the transformation from Emb(x_s) to f_{rep}, in which case, why have f_{rep} at all if Emb + fully connected can do that, and to reiterate why not just use f_{rep} instead of learning it? Point of transfer for G_{enc}: What if you just used \tilde e_s (Eq. 5) as additional embeddings the way that CoVe suggested? Instead of the concatenation with the other encoder? Contribution of alternative loss: It seems like Baseline + MacNet differs from Baseline + Encoding Layer + Modeling Layer only in the use of Eq. 15 in the former. Is this correct? If so, then this seems to be almost as important as the actual transfer learning in your final performance gains over the Baseline in Table 2. How much gain comes from just switching to that kind of loss? And how does this kind of ablation work with summarization? Is it all about the loss in that case? I'm reserving very confident on reproducibility for code that is going to be open sourced. Anything with transfer learning and big tasks like NMT and summarization can be finnicky, so I'm sure people are bound to run into problems. The details seem to be in the paper so that wouldn't be the headache.

Reviewer 2

This paper studies whether a pre-trained machine comprehension model can transfer to sequence-to-sequence models for machine translation and abstractive text summarization. The knowledge transfer is achieved by 1) augmenting the encoder with additional word representations which re-used pre-trained RNN weights from machine comprehension; 2) adding the RNN weights from the MC’s modeling layer to the decoder. Overall, I think the presented ideas could be somewhat interesting, but the current results are not convincing enough. The authors failed to give a clear explanation on why this knowledge transfer should work and what is the main intuition behind this. According to Table 2, sharing the encoding layer gives ~1 point gain, while sharing the modeling layer really helps very little. The focal loss seems to help 0.7 ~ 0.9 points; however, this gain shouldn’t be merged with the other two components as the main theme of this paper is to do knowledge transferring from machine comprehension. Therefore, the improvement from re-using the pre-trained weights is indeed quite limited. Moreover, it is unclear why sharing the encoding layer can actually help -- especially it encodes a full paragraph for SQuAD while it only encodes a single sentence for WMT. Would a pre-trained language model help in the same way (see Peters et al, 2018, Deep contextualized word representations)? For the summarization result (Table 4), the absolute improvement is also small. Can you give a similar ablation analysis to show that how much of the gains come from sharing the encoding layer, the modeling layer and the focal loss? Some other comments: - The citation format needs to be fixed, for example, line 15, there is no space before [Seo et al, 2017]. - Cui 2017a and Cui 2017b are the same paper. - 21: I think SQuAD cannot be called the “one of the toughest machine comprehension tests”. It is better to call it “the most commonly used MC test” or “one of the most important MC tests so far”. - 109: in the follows -> as follows - 199-200: the presented results on SQuAD is already quite lower than the state-of-the-art (see the SQuAD leaderboard). - In introduction, there a few places which cite the literature of MC work but the citations are sort of arbitrary: 1) Line 15: [Seo et al., 2017; Shen et al., 2017a] 2) Line 19: end-to-end neural networks [Wang et al., 2017; Xiong et al., 2017a; Cui et al., 2017a] 3) Line 34: many methods[Seo et al., 2017; Wang et al., 2016; Xiong et al., 2018]. - Additionally, the paper describes the proposed approach as a “machine comprehension augmented encoder-decoder supplementary architecture”. Not sure what “supplementary architecture” means here but it is more like doing transfer learning from an MC task to a language generation (seq2seq) task. ========= I have carefully read all other reviews and the authors' rebuttal. My score remains the same.

Reviewer 3

The paper proposes a transfer learning approach of re-using both the encoder and the modeling layer from a pre-trained machine comprehension model in sequence-to-sequence models. The augmented sequence-to-sequence model shows nice gains on both NMT and summarization. The experiments also show some ablations indicating that both the encoder and the modeling layer provide some gains. The experiments also show that just adding randomly initiatized (not pretrained) parameters does not improve, so the improvements are not just an artifact of increased model capacity. This work is part of a growing sequence of works showing the effectiveness of pretraining on one task and transferring to another (McCann et al, 2017; Peters et al, 2018). Prior work has looked at transferring MT and/or language modeling. This work differs somewhat both in a) what is transferred and b) the source task. I have not seen machine comprehension as a source task before. Strengths: Shows the efficacy of transferring knowledge from machine comprehension to two difference tasks. Includes ablation experiments showing the effect of subcomponents and random initialization vs. pretraining. Figure 1 is very helpful for understanding the proposed model. Weaknesses: There appear to be 3 components of "MacNet": 1. The Encoding Layer 2. The Modeling Layer, 3. Using the focal loss (Lin et al, 2017). My impression is that +/- the focal loss is the difference between the "Baseline+Encoding Layer + Modeling Layer" row and the "Baseline + MacNet" row. This row actually has the larger differences compared with adding either of the pretrained components, which is somewhat disappointing. Are ~50% of the gains due to adjusting the objective? How well does this do on its own (without the pretraining parts)? It would be nice to clarify the contributions of both: a) using this task versus previously proposed tasks for pretraining and b) using this approach vs. other pretraining transfer learning approaches. Given that both changed simultaneously, it's not clear how much to credit the task vs. model. It could be that the task is so good that other existing transfer learning approaches are even better, or that the method is so good it would help with other source tasks even more, or both or neither. Thanks for the response. It would be good to include the comparisons to Cove and the summarization ablations in a final or subsequent version.