Reviews: RUBi: Reducing Unimodal Biases for Visual Question Answering

Originality: The proposed method is a novel dynamic loss re-weighting technique applied to VQA under changing priors condition, aka VQA-CP, where the train and test sets are deliberately constructed to have different distributions. The related works are adequately cited and discussed. While prior works have also focused on using knowledge from a question-only model to capture unnecessary biases in the dataset [25], the paper differs from [25] in some key aspects. E.g., the proposed model guides the whole model (including the visual encoding branch) to learn "harder" examples better whereas [25] focuses on only reducing bias from question encoding. Quality: The proposed method is sound and well-motivated. The experimental setup is mostly sound and is on par with prior works. I have some qualms about absence of some common-sense baselines but that is not entirely the authors fault since the prior works have also failed to set a precedent. My another minor issue with the paper is lack of discussion of shortcomings / future work for the paper. Clarity: The paper is very easy to read and the major contributions and background work is clearly laid out. The method descriptions are complete and contain enough detail to be able to reproduce. Significance: The paper overall has a moderate significance. The proposed method is sound and is clearly demonstrated to work well for existing dataset. It is likely that the proposed method will be used in other bimodal tasks as well. However, common with existing works in this space, the algorithm is designed with prior knowledge about how exactly the (artificially perturbed) VQA-CP was constructed. Therefore, to me, it is slightly less significant compared to (hypothetical) alternative algorithms that propose a robust vision-language understanding rather than developing specialize algorithms just for "reducing" effects of bias on an specially constructed dataset. It is still a valuable contribution, just a very specific one instead of a true general solution. *** POST REBUTTAL COMMENTS *** As I said during my original review, I think that the "good results" obtained on VQA-CP alone is not *very* significant as there haven't really been carefully established baselines. I thank the authors for their rebuttal, which already shows that even the most naive baselines perform over 4% over the existing numbers. Again, I do not put the burden of this on the authors; but this is something the whole community has sort of ignored. That being said, the improvements to the robustness by the proposed model is undeniable. In the rebuttal the authors also show that it improves robustness in VQA-HAT. Most importantly, the proposed method is an model-and-task agnostic de-biasing technique that I think will be useful to present to large community in NeurIPS. After reading other reviews and author rebuttal, I am raising my score to a 7.

Reviewer 2

The paper proposed a strategy which guides the model logits toward more ‘question biased’ logits. The effect of this is that when the prediction is the same as the label, the loss is smaller. If the prediction is not the same, then the loss is larger. The model achieves this by merging the normal VQA model logits with the logits that is generated based on question only. During testing, the path from the question-only side is removed so that only the VQA logits will be used during prediction. The experiments show the method outperforms existing methods in the VQA CP data. The paper is well written and easy to follow. The role of the c’_q classifier is a bit unclear. Since the outputs of c_q are already the logits, why is it necessary to feed c_q logits into another classifier? What if the logits of c’_q and c_q do not match each other? Is there any way to enforce the consistency of the outputs of c’_q and c_q? The way of combining question-only model is not unique. Why choosing this model? Have you tried any other candidates? The method seems to perform well in the VQA CP data. However, its use case may be very limited. It is only applicable in VQA where question biases are present. It cannot be adapted to other applications. It also does not address other robustness issues of VQA, for example, the circle consistency. Therefore, although this idea seems novel and valid for this application, its impact may be low. *** After reading the author feedback *** I appreciate the additional experiments and clarifications wrt the role of c'_q and the ablation wrt the combination of the q-model. This clarifications and the additional results add strength to the paper.

Reviewer 3

Summary - The authors address the task of VQA with a specific interest in minimizing unimodal bias (specifically, language bias). For that they propose a general approach that can be combined with different VQA architectures. The main idea is to make use of a question-only model to VQA (no image input), and reduce/increase the loss of the data points that are correctly/incorrectly answered by the question-only model. - The experiments are carried out on the VQA-CP v2 dataset. A direct comparison with the prior state-of-the-art [25] (with the same underlying VQA architecture) shows a superior performance of the proposed method. The authors also show that their method only leads to a small drop in performance on the standard VQA v2 dataset. - The qualitative results in Fig 4 are interesting and informative. Originality - The proposed is overall novel, to the best of my knowledge. Nevertheless, it resembles the method of [25], which while mentioned, could be compared to the proposed approach more thoroughly. E.g. [25] also contains a question-only branch to “unbias” the VQA models. It would help if the authors illustrate the two approaches side by side in Fig 2 and provide a detailed comparison. - There are other recent works exposing unimodal bias in tasks such as embodied QA [A,B], vision-and-language navigation [B] and image captioning [C]. [A] Ankesh Anand, Eugene Belilovsky, Kyle Kastnerand, Hugo Larochelle, and Aaron Courville. Blindfold baselines for embodied QA. NeurIPS 2018 Workshop on Visually Grounded Interaction and Language (ViGIL). [B] Jesse Thomason, Daniel Gordon, and Yonatan Bisk. Shifting the baseline: Single modality performance on visual navigation & QA. NAACL 2019. [C] Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. EMNLP 2018. Quality - It is not entirely clear how the classifier c_q is trained; L150 mentions a cross-entropy loss, but it does not seem to be visualized in the approach overview Fig 2? (I assume that the classifier c_q represents the question-only VQA model?) - Later another loss, L_QO, is introduced in L184 for the classifier c_q’; it is supposed to “further improve the unimodal branch ability to capture biases”; what is the connection of this loss to the one mentioned above? What is the role of the classifier c_q’? - The experiments are only carried out on VQA-CP v2, not on VQA-CP, as done in [25]. - There is no discussion of why the proposed baseline benefits from the proposed approach much more than e.g. the UpDn model (Table 1 vs. Table 2). - It would be informative to include the detailed evaluation breakdown (by answer type) for the experiments in Table 2 (similar to Table 1), for a more complete “apples-to-apples” comparison to [25]. Clarity - The authors title the paper/approach “Reducing Unimodal Biases”, while they specifically focus on language biases (question/answer statistics). Some prior work has analyzed unimodal biases, considering both language-only and vision-only baselines, e.g. [B]. It seems more appropriate to say “language biases” here. - It is not clear from Fig 2 where the backpropagation is or is not happening (it becomes more clear from Fig 3). - L215 makes a reference to GVQA [10] performance, but it is not included in Table 1. - The prior approaches in Table 1 are not introduced/discussed. - Writing: L61: from => by, L73: about => the, L196 merge => merges Significance - There is a great need to address bias in deep learning models, in particular in vision-and-language domain, making this work rather relevant. While the proposed method does not appear groundbreakingly novel, it is nevertheless quite effective. The main issues of the submission are listed above (e.g. comparison to [25], confusing parts in the approach description, somewhat limited evaluation). Other The authors have checked "Yes" for all questions in the Reproducibility checklist, although not all are even relevant. UPDATE I appreciate the new baselines/ablations that the authors provided upon R2’s and R3’s request. Besides, they included convincing results on VQA-CP v1 and promise to address some of my other concerns in the final version. I thus keep my original score “7”.

Paper ID:	462
Title:	RUBi: Reducing Unimodal Biases for Visual Question Answering

Reviewer 1

Reviewer 2

Reviewer 3