NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:521
Title:Dialog-based Language Learning

Reviewer 1

Summary

The paper describes a number of alternative ways to accomplish fact learning through a dialog where feedback is provided in different forms and different levels of completeness. Then the paper describes several modifications to a baseline Memory Network architecture that are said to be able to perform the proposed style of learning.

Qualitative Assessment

The paper makes a very nice exposition of a good variety of ways of how one may learn through a dialog. These learning settings are believed to better approximate how learning happens in nature. However, before diving into discussing the detailed solutions, there is only superficial mentioning about how a basic QA dataset is modified to construct the type of feedback-carrying dialog that is used in the experiments, and there is no justification of how well such data mimic actual learning dialogs in nature. A promise of publishing the dataset after publication of this paper does not solve this problem, because the motivation and validity of the claimed solutions, the merits of the claimed accuracies, etc., all critically depend on what is really included in this so-far-hidden dialog dataset. For example, in the Forward Prediction method, it is not clear whether the second input from the learner (response to answer) is part of the training data, how it is related to the learner's first input and to the answer, and how it is generated during a test session. Hence it is not clear what exactly is being learned from it and being encoded into the memory network, and why the strategy performs better in the attempted settings. Without any sense of how the dialog data are like, there is no basis to judge the quality of the solutions and the meaning of all the tricks tried, as they depend on the quality of the attempted answers, the levels of feedback, any side knowledge this is involved, and how that is connected to the questions and answers. Task 6 could be renamed as "Partial Feedback" rather than Missing Feedback.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

The paper addresses the problem of set of question answering tasks in a dialog setting in which a teacher provides feedback. They have provided multiple setups with different types of feedbacks that can be available from a teacher. The goal is to go beyond a standard supervised learning setting and provide feedback-driven setup. They adapt a forward prediction model to the dialog domain. This model tries to predict the next utteranc given the current utterance and current response, i.e., predict the feedback that might be received from the teacher

Qualitative Assessment

Strengths: - The task is interesting; the introduced sub-tasks are useful. - The extension of forward prediction to memory networks is interesting. - The paper shows that the proposed model works for two datasets of bAbi and movies. Weaknesses: - The claims made in the introduction are far from what has been achieved by the tasks and the models. The authors call this task language learning, but evaluate on question answering. I recommend the authors tone-down the intro and not call this language learning. It is rather a feedback driven QA in the form of a dialog. - With a fixed policy, this setting is a subset of reinforcement learning. Can tasks get more complicated (like what explained in the last paragraph of the paper) so that the policy is not fixed. Then, the authors can compare with a reinforcement learning algorithm baseline. - The details of the forward-prediction model is not well explained. In particular, Figure 2(b) does not really show the schematic representation of the forward prediction model; the figure should be redrawn. It was hard to connect the pieces of the text with the figure as well as the equations. - Overall, the writing quality of the paper should be improved; e.g., the authors spend the same space on explaining basic memory networks and then the forward model. The related work has missing pieces on more reinforcement learning tasks in the literature. - The 10 sub-tasks are rather simplistic for bAbi. They could solve all the sub-tasks with their final model. More discussions are required here. - The error analysis on the movie dataset is missing. In order for other researchers to continue on this task, they need to know what are the cases that such model fails.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

This paper presents a methodology for learning to solve question answering tasks and argues that this methodology doesn't rely completely on supervised learning. The main argument is that the machine learns the correct way to answer through language-based interaction and with some kind of reward signal. The contributions are three-fold: i) a set of new tasks with different complexities is defined, ii) a new architecture based on Memory Networks is proposed and iii) a prediction model is built and compared to a baseline (pure imitation). The main result is that the prediction model architecture seems suitable to learn without being fed with any kind of reward signal. The main reason is that predicting that the next dialogue turn will confirm the machines answer can only be done if the machine did a good job in answering.

Qualitative Assessment

I think this paper nicely follows the track of works that have been previously done with memory networks on dialogue tasks. It defines a methodology that is sound and provides a set of interesting tasks. The experimental part reports a lot of work too. Nevertheless, in addition to my previous comment on the matching between the claims (learning language through dialogue) and the content (generating one single word and request for help), I have several other concerns. Too me, it is not really a problem to address simplified tasks if there is hope that the proposed methodology and algorithms will scale up to more complex tasks later. But here, I hardly see how a similar methodology could handle real dialogues and complex language learning. How would the author enhance their model to generate real complex queries to the user (not predefined ones), how would they handle multiple dialogue turns with complex language? Also the forward prediction architecture seems interesting but it can hardly be extended to multiple dialogue turns. I think this is why reinforcement learning is interesting in dialogue: evaluating a value function implicitly learns a transition model and predicts future outcomes. Here, the authors say that this architecture allows avoiding the definition of a reward thanks to it's ability to predict words such as "right" or "correct" but it is not always the case that the user gives feedback like this neither in real-world dialogues. Finally, this is also a kind of reward signal. There has been also a lot of works on reinforcement and imitation learning for dialogue management as well as works on dialogue failure prediction or even on language acquisition in the past that are not mentioned at all in the paper. I also think it is worth referring to these works and explain how this one would differ and lead to novel models.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 4

Summary

The authors of the paper aim to tackle the task of question answering in a dialog setting. Their main contributions are: (1) introduce 10 subtasks of dialog QA with different settings (e.g. teacher asking for correction vs teacher supplying answer), (2) take advantage of feedback from the other speaker, which is very useful but has not been much explored before, (3) learn the model in an end-to-end manner via neural networks, (4) provide several strong baselines for the task, (5) propose forward prediction model, which tries to predict the utterance of the teacher in the loop, giving better results than the baselines. The authors transform synthetic QA dataset (bAbI) and MovieQA dataset into dialog-based datasets..

Qualitative Assessment

Strengths: - It opens a new direction for dialog-based question answering. The paper gives a guideline how the natural feedback from the teacher or expert speakers can be used to answer questions more effectively. - Traditionally QA task was mainly supervised by the answers, but there exist different forms of supervision available, and sometimes only weaker supervision (e.g. Task 4 Hints). The model attempts to take the full advantage of them. - It was easy to follow and understand the paper. - Forward prediction model is simple and makes sense. Weaknesses / Major concerns: - It is difficult to evaluate whether the MovieQA result should be considered significant given that +10% gap exists between MemN2N on dataset with explicit answers (Task 1) and RBI + FP on dataset with other forms of supervision, especially Task 3. If I understood correctly, the different tasks are coming from the same data, but authors provide different forms of supervision. Also, Task 3 gives full supervision of the answers. Then I wonder why RBI + FP on task 3 (69%) is doing much worse than MemN2N on task 1 (80%). Is it because the supervision is presented in a more implicit way ("No, the answer is kitchen" instead of "kitchen")? - For RBI, they only train on rewarded actions. Then this means rewardless actions that get useful supervision (such as "No, the answer is Timothy Dalton." in Task 3) is ignored as well. I think this could be one significant factor that makes FP + RBI better than RBI alone. If not, I think the authors should provide stronger baseline than RBI (that is supervised by such feedback) to prove the usefulness of FP. Questions / Minor concerns: - For bAbI, it seems the model was only tested on single supporting fact dataset (Task 1 of bAbI). How about other tasks? - How is dialog dataset obtained from QA datasets? Are you using a few simple rules? - Lack of lexical / syntactic diversity of teacher feedback: assuming the teacher feedback was auto-generated, do you intend to turk the teacher feedback and / or generate a few different kinds of feedback (which is more real-life situation)? - How does other models than MemN2N do on MovieQA?

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 5

Summary

This paper proposes to learn question answering models using feedback in the form of short textual exchanges. This can be seen as learning from a very primitive form of dialog between a teacher (who provides the questions and some feedback) and another student (who provides answers that are correct with some fixed probability). The paper proposes alterations to two existing datasets: the artificial bAbI dataset, and a movie QA dataset, in order to incorporate 10 different types of supervision from the teacher and student. In some settings, an explicit reward is given when the student answers correctly. An augmentation to the memory network architecture is proposed that achieves strong performance on the bAbI dataset, and moderate performance on the real-world MovieQA dataset.

Qualitative Assessment

The proposed research direction of exploring models that learn from different types of conversational feedback is interesting and, to my knowledge, novel. The presentation of the task is well-motivated, and could lead to some interesting future research directions. The paper is fairly well-written. When proposing a new task, it is often difficult to determine how challenging to make the task. By far the most interesting task proposed is task 7, where no reward-based feedback is given. However, it would be very nice to have more variations of this task. For example, one could have a varying number of script templates for correct and incorrect examples, vary the length of the responses, or have more realistic ‘noisier’ responses (ie. where the feedback is hidden in a larger response). This would isolate the language component of the feedback for learning QA policies – it is important to determine what kind of language feedback can be learned from, and this paper only explores one such aspect (whether additional information is included from the teacher in the response), and this only in the setting with reward-based feedback. In my view such tasks would be more useful than some of the other proposed tasks (for example tasks 9 and 10, detailed below). I am slightly concerned that the remainder of the tasks proposed, those with reward-based feedback, are too easy, and that the variations between tasks are too small. However, in some ways this is good for a new task, as one can isolate the cases where a proposed model under-performs (which is hard to do with complex tasks). It is impossible to know with certainty whether the tasks are too easy until more diverse models (other than just memory networks) are tried. A crucial point of the paper is that the learner (ie. the memory network model being learned) is not the one actually interacting with the teacher. Instead, the model simply observes the conversation between the teacher and some other student, and learns to answer the questions from varying forms of feedback. It should be made more explicit in the abstract and introduction that the model being learned in this paper is not a dialog agent, but is a question answering system. This becomes evident much later in the paper, however for clarity this should be addressed earlier. The formulation of tasks 9 and 10, where the student asks for help, is also unclear. In the description, it is stated that ‘the learner must ask for help in the dialog’. This seems to imply that the learner must *learn* to ask for the help of the teacher. However, the learned model is not interacting with the teacher at all – it is simply observing the other student answering this question. Thus, this is indeed very similar to tasks 3 and 5 (which is reinforced by the similarity of the results in Tables 1 and 2). This can be inferred from later sections in the paper, but is not clear enough at the outset. The proposed memory network extension incorporating forward prediction is interesting, and its success is well-justified in section 5. The experiments are well-conducted – in particular, the results on the MovieQA dataset are crucial to demonstrate that this supervision approach can apply to real-world tasks, and provides room for improvement for future models. The fact that all of the models and baselines are memory network-variants is a slight negative. Also, it would be good to have some kind of error estimates – the models that should achieve random performance (FP model on task 1, RBI model on task 7) exhibit significant variance on the bAbI dataset (fluctuating from 20-29%), which implies that there could be some variance in the remainder of the results. This appears to be less of a problem on the MovieQA dataset. Overall, this is a fairly good paper that makes an interesting contribution in terms of new tasks and models. There is some concern that these tasks are not difficult enough, as well as some issues with clarity in the paper (detailed above), which should be addressed by the authors. However, I could see the proposed tasks becoming a bAbI-style toy task used for model justification, so that these models can later be used for more difficult versions of the tasks. Small comments: Line 57: put a period after [14] in related work

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 6

Summary

This paper presents a set of natural language feedback types for learning from a teacher in a dialog set up, and a set of models, including imitation learning and predictive lookahead, based on memory networks used to represent the dialog context (as was presented previously). The models are compared for different feedback types on two datasets: bAbI dataset and a movie question answering dataset.

Qualitative Assessment

The set of feedback types (i.e., tasks) presented in the paper is extensive, however the data is artificially generated. Humans learn language from mixed feedback, and the feedback type is not known. Hence, I think the paper would be much more interesting if a mixed set-up was created (even though artificially), and included in the experimental framework where the feedback type is not specified as input. Furthermore, imitating an expert student set-up is the same as learning from human-human dialogs, and there are previous studies on this type of interactions for both goal-oriented systems and chit-chat conversations. It would be useful to include a review and comparisons. It would also help to see a further analysis of how each approach differed. For example, the reward-based imitation model is trained with a subset of the data f the imitation-based method. So, why do the performances differ that much. Are there any other differences? How is the set of action candidates determined for the forward prediction model? Is the set of candidates fixed and finite? If so, how would this apply to question answering? Is this approach scalable to open domains? It would help to see some discussion. The natural language feedback could also be converted to a reward (i.e., reward shaping work of Su & Vandyke et al., SigDial 2015). Could such rewards be useful here? Figure 1 uses too much space, the dialog context could be presented just once, instead of for each task, and the new space could be used for analysis. What is "A:"? The previous turns do not have turn labels... How is the class of the answer determined for fabricating data for task 4? For example, why it is not "no, the director is male", but not "no, it s a director", which would make more sense...

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)