NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:188
Title:Hierarchical Question-Image Co-Attention for Visual Question Answering

Reviewer 1

Summary

This paper presents a new method for VQA. As a main novelty, the paper proposes a co-attention mechanism that jointly reasons about visual (image) and textual (question) attentions. Furthermore, the proposed method considers a hierarchical model by using co-attention maps at a word, phrase, and question levels. Results indicate a slightly improvement with respect to alternative state-of-the-art approaches.

Qualitative Assessment

The paper presents an incremental contribution with respect to previous methods for VQA that only exploit an image attention mechanism guided by question data. Here, they also consider a question attention mechanism guided by image information. In this sense, the main hypothesis of this work is that jointly considering visual and question attention mechanisms can improve the performance of current VQA systems. I agree that this hypothesis can be relevant for the case of long questions, but I believe there is also a risk that question based attention guided by image information can be misleading, in the sense that usually an image includes several information sources, while the question is more focused. In Figure 3, authors include a graph that shows the impact of question length in performance, while this figure seems to show a tendency, the effect is still weak, maybe a numerical analysis can help to support this point. I believe, an analysis of potential differences (not only question length) between most common errors of previous works (only image attention) and the proposed approach (image and question attention) can help to support the relevance of the proposed attention mechanism. In terms of numerical results, in the abstract and in Section 4.3, the authors claim an increase in performance with respect to alternative state-of-the-art approaches of 60.4% to 62.1% in VQA dataset, and 61.6% to 65.4% in COCO-QA. However, these numbers are misleading, they consider different visual features. In particular, alternative methods use VGG, while the proposed method uses deep residual features which provide superior performance. Considering the same VGG features, the proposed method presents a smaller increase 60.3% to 60.5% in VQA and 61.6% to 63.3% in COCO-QA. While the authors also include results of their method using VGG features, it is important to clarify this part, for example, in the abstract. Authors claim novelty in the CNN convolucional 1D method for text processing, but I believe the proposed method include just minor variations with respect to previous works, such as https://arxiv.org/pdf/1511.02274v2.pdf. In some parts of the paper, the description of the method is very limited. I believe, it will be difficult to repeat the results of this paper based only on the information included. As an example, it is not clear to me, how do the method combine the hierarchical representations for the question embedding (word, phrase, question)?, is it the same approach used to encode the answers (Eq.7)?. Also, in Eq. 6, why do you use the same projection matrix for visual and question features? The use of Eq.4 instead of Eq.3 to obtain the attention maps needs further explanation. It will be good to include comparative results. Also, it will be good to include more intuitions about the different terms in Eq. 4. In Figure 4, the color attention code for words is not clear. In this sense, the visualization in Figure 4 does not support the claim: "our model is capable of localizing the main part of the question". I recommend to normalize the values to highlight with red the most relevant parts of the question. Minor issues: - Line 107, what the "c" stands for? - Line 88 hierarchical - Line 124, c_{i,j} should be C. - Eq. 5, i=1..N instead of n=1..N. - Line 142, equal to 1. - Line 150, missing a "to". - Line 136 Breifly. - Line 237 "we can find some have"

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 2

Summary

This paper presents a novel co-attention model to attend both image regions and question words jointly for visual question answering. In addition, authors also propose a hierarchical approach to attend the image and question at three levels of question granularity. Results on two different datasets show that the proposed model can outperform the state-of-the-art.

Qualitative Assessment

Overall, I like the main idea of the paper. However, I have few concerns: - The paper is full of typos and grammatical mistakes. I would suggest the authors to carefully proof-read the paper. - line 75: Wrong author name for the reference mentioned. - line 121: Please clarify better the similarities and differences of your parallel co-attention model and that from [21]. - Notations and equations mentioned in section 3.4 are not consistent. For example, the concatenation operator is missing but described. - Experimental section needs a lot of clarifications especially VGG and Residual Net models are suddenly mentioned in the table and compared, no description of the motivation & model details are given beforehand. - A brief review of the compared models are necessary. Also, did you calculate any statistical significance test to justify the differences between the models? - Discussion and comparison about the parallel co-attention is missing from figure 3, explain why. Why does the accuracy decreases with the increase in question length in general? - Which co-attention is used in Figure 4? Please clarify. - A detailed error analyses of the proposed models are necessary to improve the technical clarity of the paper. Why are the accuracies still in the 60-70% range? Some examples of incorrect answers would be useful and explanations as to why were they incorrect? Also, some possible improvement directions should be provided.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

This paper presents a hierarchical co-attention mechanism for visual question answering. The model learns to attend to important phrases in the question and salient regions in the image so as to perform visual question answering more effectively.

Qualitative Assessment

The paper is well explained. Information flows naturally throughout the paper. The hierarchical attention structure on word, phrase, and question-level, and the two co-attention mechanisms are the main contributions of this work. The described framework makes intuitive sense. The experiments are comprehensive and well-thought out. Results on VQA and COCO-QA datasets show that the proposed approach substantially outperforms other state-of-the-art systems, which is quite impressive. In particular, I found the ablation study to be useful, which quantifies the importance of individual components in the infrastructure. Some minor aspects: - It would be great if the motivation of Equation 4 could be explained before its introduction (i.e., elaborate on lines 128-129). - Although it is not directly related to the visual question answering task (and albeit that the paper is on arXiv), I found the following paper on co-attention to be interesting and perhaps spiritually related. I'd be happy to see it included in the related work. Attentive Pooling Networks Cicero dos Santos, Ming Tan, Bing Xiang, Bowen Zhou https://arxiv.org/abs/1602.03609 - Space permits, I'd suggest the authors to comment on the performance of parallel and alternating co-attention mechanisms, not just listing the numbers, but speculate on why and under what condition would one perform better than the other.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

The authors propose a novel method for joint modeling visual and textual attention. The method also uses a novel hierarchical way of combing the attention from different granularity levels. The proposed method achieves better performance of all the previous methods at the time on the two largest VQA datasets.

Qualitative Assessment

The paper is well written and presented. The method is properly evaluated and compared to the relevant and state-of-the-art methods at the time. Even though the authors have done a fine job, the novelty is not that great. As the authors point out, previous works already have proposed visual and text attention models. The related works they include [21] and [22] are very similar. They also use n-grams and hierarchy of attention maps. I suggest they include more text pointing our the differences with these two methods and the other VQA methods that use attention (especially the ones they are comparing against), and remove some of the language attention related work. Next, the results are not significant, for Open-Ended is 0.2% less or more than [20]. DNM+ is not evaluated on Multiple-Choice but it can be expected to have similar performance. Thus, without error bars, 0.2% increase in accuracy does not look like a significant improvement. Finally, again even though this is a good paper, the potential impact is low since only after two weeks of the paper submission, the VQA state-of-the-art has been pushed by a significant margin.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 5

Summary

This paper presents a way to model "where to look" for both image and text modalities using a "co-attention" model for VQA that reasons about image and question attention. Previous work only dealt with attention over images, and visualized the image regions relevant to answering the question. The paper also represents the text input hierarchically using 1-dim CNNs that generate word-level, phrase-level, and sentence-level representation as in (Yin et al, ACL 2016). The method used for VQA and COCO-QA datasets provide good results compare to previous STOA (namely DMN+).

Qualitative Assessment

The paper provides a nice way to visualize both text and image attentions using the co-attention model. This has potential impact namely for Multilingual and MultiModal Neural Machine Translation. I noted some typos : line 75-76 : it's not Hermann et al but Bahdanau et al, line 135 and line 150 also have typos. I was not convinced by the results namely on the VQA dataset (table 1). It seems that the main difference between DMN+ (previous STOA) and the best method presented in the paper is mostly explained by Residual features compared to VGG features. Without ResNet features, the results do not seem significantly better than DMN+. While the fact of being able to visualize both text and image attention is nice, claiming that the method outperforms all previous results is doubtful ("our approach outperforms all previous results, improving the state of the art from 60.5% to 62.1%" line 183. What are the comparable results of Ours^p Ours^a for VGG instead of Residuals?). Table 3 provides a nice ablation study of the model, eventhough it would have been nice to provide the DMN+ baseline. It would also more thorough to add that it's on the VQA val set in the caption of the Table. At line 184 and 185, the authors talk about an improvement of 0.9%, 1%, and then 0.6% and 2.0% for the questions Other and Num. I can't associate it to the results in Table 1, was there a typo or am I missing something? I have put sub-standard for "potential impact or usefulness", because while the co-attention is nice for visualizing both text and image attentions, I think the results are not so convincing and there has already been new results (work done at the same time, see VQA challenge) that outperformed significantly all the methods presented in this paper.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 6

Summary

This paper presents a co-attention model which jointly reasons about the image and question attention for the task of Visual Question Answering. Both image and question attention maps are estimated by calculating the correlations between each image region and each question word. To model the question feature, the authors utilize a hierarchical architecture with 1-D convolutional features and LSTM features. The experimental results show superior performance compared with the other methods. The visualization of the attention map confirms the ability of the proposed model to focus its attention on different regions of the image as well as different segments of the question.

Qualitative Assessment

The overall method is sensible and the experimental results are good. I have a few concerns as follows. 1, I am curious what if the authors do the alternating co-attention multiple times, i.e., the final attended question feature is used again as the guidance to attend the image. Does it improve the performance? 2, It would be nice if a Question Attention alone model is provided in section 4.4. 3, In line 148, it is inappropriate to say "we take the top 1000 frequent answers" since there are only 430 answers in COCO-QA dataset.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)