NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
This paper explores an interesting question: if we are allowed certain control over the input to a pre-trained language model, can we get it to return an arbitrary sentence? The control given is a vector z associated with the sentence, which is added as a bias to the hidden state at each timestep. In forward estimation, gradient descent is used to find the optimal z to "bias" the decoder towards a given sentence. In backward estimation, a given z is decoded to find the MAP sentence it encodes (which is intractable in general, so the authors use beam search). The authors analyze the "effective dimensionality" of a sentence space given a recoverability threshold tau; that is, what's the smallest dimension such that at most a tau-fraction of sentence fail to be encoded? Interestingly, larger models seem to require lower dimensional inputs to recover sentences. The authors only check three sizes of model, but there is a convincing trend here. I want to like this paper; I'm interested in just how much can be memorized by big LMs, and I feel like this is getting at an important piece of the puzzle. But this feels like one pretty limited experiment and I'm not sure there's enough there for a strong NeurIPS paper. One technical quibble: I wonder if the results would be different if rather than a random matrix, smaller vectors were still bucketed into K buckets, the model attended over buckets, and those vectors were projected up after attention. I don't think this is equivalent to the matrix multiply version (having a softmax introduces a nonlinearity and fundamentally changes the model). There seems to be a phase transition always around 2x model dimension, which indicates to me that somehow having 4 vectors to look at really gives the model more to work with in some way. The core experiment is good and gives some interesting results. Perhaps it makes sense that in the limit, we should be able to provide a lower-dimensional indicator of a sentence and recover it, though I'm not sure what the limit of this process is, since I have no idea what the intrinsic dimensionality of language is. The major caveat in this work is whether the geometry of the z space has been meaningfully studied here. There are several limitations: (1) This assumes that all we care about are generating sentences in-domain. Gigaword is a much narrower corpus than, e.g., what BERT is trained on, and moreover much of the interest in pre-training is transfer to new settings. It seems disingenuous to pitch the model as encoding "arbitrary" sentences and then they're from the training distribution. (2) English is assumed to be a proxy for all languages. (3) More minor, but it seems like the vast majority of such models have moved to some type of generation more closely conditioned on the input (attention of various forms), so I'm not sure how practically useful this insight is for model designers. I think there's room in an 8-page paper to cover issues #1 and #2 in some depth. The authors don't necessarily need to go in this direction, but I want to see something more. Overall, I like the direction a lot, but I feel like I can't strongly endorse this as NeurIPS until there's a stronger conclusion that can be drawn. One presentation note: Section 3.2 could be written more efficiently. I was confused by the lack of detail in lines 106-130, but then this was largely recapitulated on the following page with additional details. ===================== Thanks for the nice response. I've raised my score to a 6 in light of the out-of-domain results provided; this comparison with random sentences is intriguing and tells us that what we can memorize are arbitrary *sentences* and not arbitrary strings. However, I still have an overall feeling that I'm not quite sure how to contextualize these results based on what's in the paper.
Reviewer 2
The paper investigates whether it is possible to convert the pre-trained language model into the generator that can be conditioned on the vector representation of the sentence. This conditioning can be made by adding the vector z (in one form or another depending on the dimensionality) to the hidden states of the LSTM LM. The sentence representation can be found by employing nonlinear conjugate gradient descent for maximizing the probability of the given sentence where optimization parameters are represented by the components of the vector z. Basically, the encoder of a sentence is represented by the optimization process. For sentence decoding, a beam-search is used. For sufficiently big LM almost perfect sentence recoverability from the Z space is possible. The paper explores an important direction considering recent development in unconditional text generation. However, it is not exactly clear how to apply this approach to attention-based language models (e.g. GPT), adding such discussion will clarify the limitations of the proposed method. The paper is well structured and contains all the necessary bits to understand the proposed method clearly.
Reviewer 3
The paper concerns itself with whether it is feasible to use a pretrained language model as a universal decoder. To this end, it proposes an approach to forcing a pretrained language model to output a particular target sentence, by essentially adding a bias vector to the pretrained LM's hidden state at each time step. The paper shows that for sufficiently large pretrained LMs it is possible to optimize with respect to this bias vector such that a held-out target sentence can generally be decoded (using beam search) with high accuracy. The paper is easy to follow, and the idea of a universal decoder is compelling (and likely to be on the minds of many NLP people), and so I think the results presented in this paper will have an impact. At the same time, the question of whether it will actually be practical to have a universal decoder remains unanswered, since (as the authors partially note) it is unclear whether a pretrained LM can generate text that is sufficiently different from that on which it was trained, it is unclear whether an encoder can mimic the optimization with respect to the z's (though it seems reasonable to be optimistic about this), and it is not clear whether the finite capacity of the z vectors will end up being a practical issue, especially for generating longer sentences. The experiments are largely convincing. However, although it isn't completely clear, it sounds as though the beam search is carried out assuming the true length T, which may lead to an overly optimistic numbers. Similarly, it would be good to establish whether larger beams ever decrease performance (as they often do), which might be another reason for caution. Update after response: thanks for the out-of-domain results and the beam search experiments. These results are certainly encouraging, and I continue to recommend acceptance.