{"title": "TAB-VCR: Tags and Attributes based VCR Baselines", "book": "Advances in Neural Information Processing Systems", "page_first": 15615, "page_last": 15628, "abstract": "Reasoning is an important ability that we learn from a very early age. Yet, reasoning is extremely hard for algorithms. Despite impressive recent progress that has been reported on tasks that necessitate reasoning, such as visual question answering and visual dialog, models often exploit biases in datasets.  To develop models with better reasoning abilities, recently, the new visual commonsense reasoning(VCR) task has been introduced. Not only do models have to answer questions, but also do they have to provide a reason for the given answer.  The proposed baseline achieved compelling results, leveraging a meticulously designed model composed of LSTM modules and attention nets. Here we show that a much simpler model obtained by ablating and pruning the existing intricate baseline can perform better with half the number of trainable parameters. By associating visual features with attribute information and better text to image grounding, we obtain further improvements for our simpler & effective baseline, TAB-VCR. We show that this approach results in a 5.3%, 4.4% and 6.5% absolute improvement over the previous state-of-the-art on question answering, answer justification and holistic VCR. Webpage: https://deanplayerljx.github.io/tabvcr/", "full_text": "TAB-VCR: Tags and Attributes based Visual\n\nCommonsense Reasoning Baselines\n\nJingxiang Lin, Unnat Jain, Alexander G. Schwing\n\nUniversity of Illinois at Urbana-Champaign\n\nhttps://deanplayerljx.github.io/tabvcr\n\nAbstract\n\nReasoning is an important ability that we learn from a very early age. Yet, reasoning\nis extremely hard for algorithms. Despite impressive recent progress that has been\nreported on tasks that necessitate reasoning, such as visual question answering\nand visual dialog, models often exploit biases in datasets. To develop models\nwith better reasoning abilities, recently, the new visual commonsense reasoning\n(VCR) task has been introduced. Not only do models have to answer questions,\nbut also do they have to provide a reason for the given answer. The proposed\nbaseline achieved compelling results, leveraging a meticulously designed model\ncomposed of LSTM modules and attention nets. Here we show that a much simpler\nmodel obtained by ablating and pruning the existing intricate baseline can perform\nbetter with half the number of trainable parameters. By associating visual features\nwith attribute information and better text to image grounding, we obtain further\nimprovements for our simpler & effective baseline, TAB-VCR. We show that this\napproach results in a 5.3%, 4.4% and 6.5% absolute improvement over the previous\nstate-of-the-art [103] on question answering, answer justi\ufb01cation and holistic VCR.\n\n1\n\nIntroduction\n\nReasoning abilities are important for many tasks such as answering of (referential) questions, dis-\ncussion of concerns and participation in debates. While we are trained to ask and answer \u201cwhy\u201d\nquestions from an early age and while we generally master answering of questions about observations\nwith ease, visual reasoning abilities are all but simple for algorithms.\nNevertheless, respectable accuracies have been achieved recently for many tasks where visual\nreasoning abilities are necessary. For instance, for visual question answering [9, 32] and visual\ndialog [20], compelling results have been reported in recent years, and many present-day models\nachieve accuracies well beyond random guessing on challenging datasets such as [30, 47, 109, 37].\nHowever, it is also known that algorithm results are not stable at all and trained models often leverage\nbiases to answer questions. For example, both questions about the existence and non-existence of\na \u201cpink elephant\u201d are likely answered af\ufb01rmatively, while questions about counting are most likely\nanswered with the number 2. Even more importantly, a random answer is returned if the model is\nasked to explain the reason for the provided answer.\nTo address this concern, a new challenge on \u201cvisual commonsense reasoning\u201d [103] was introduced\nrecently, combining reasoning about physics [69, 99], social interactions [2, 89, 16, 33], understanding\nof procedures [107, 3] and forecasting of actions in videos [84, 26, 108, 90, 28, 74, 100]. In addition\nto answering a question about a given image, the algorithm is tasked to provide a rationale to\njustify the given answer. In this new dataset the questions, answers, and rationales are expressed\nusing a natural language containing references to the objects. The proposed model, which achieves\ncompelling results, leverages those cues by combining a long-short-term-memory (LSTM) module\nbased deep net with attention over objects to obtain grounding and context.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Associating attributes to VCR tags\n\n(b) Finding tags missed by VCR\n\nFigure 1: Motivation and improvements. (a) The VCR object detections, i.e., red boxes and labels in blue are\nshown. We capture visual attributes by replacing the image classi\ufb01cation CNN (used in previous models) with an\nimage+attribute classi\ufb01cation CNN. The predictions of this CNN are highlighted in orange . (b) Additionally,\nmany nouns referred to in the VCR text aren\u2019t tagged, i.e. grounded to objects in the image. We utilize the same\nimage CNN as (a) to detect objects and ground them to text. The new tags we found augment the VCR tags, and\nare highlighted with yellow bounding boxes and the associated labels in green .\n\nHowever, the proposed model is also very intricate. In this paper we revisit this baseline and show that\na much simpler model with less than half the trainable parameters achieves signi\ufb01cantly better results.\nAs illustrated in Fig. 1, different from existing models, we also show that attribute information about\nobjects and careful detection of objects can greatly improve the model performance. To this end we\nextract visual features using an image CNN trained for the auxillary task of attribute prediction. In\naddition to encoding the image, we utilize the CNN to augment the object-word groundings provided\nin the VCR dataset. An effective grounding for these new tags is obtained by using a combination of\npart-of-speech tagging and Wu Palmer similarity. We refer to our developed tagging and attribute\nbaseline as TAB-VCR.\nWe evaluate the proposed approach on the challenging and recently introduced visual commonsense\nreasoning (VCR) dataset [103]. We show that a simple baseline which carefully leverages attribute\ninformation and object detections is able to outperform the existing state-of-the-art by a large margin\ndespite having less than half the trainable model parameters.\n\n2 Related work\nIn the following we brie\ufb02y discuss work related to vision based question answering, explainability\nand visual attributes.\nVisual Question Answering. Image based question answering has continuously evolved in recent\nyears, particularly also due to the release of various datasets [65, 73, 9, 101, 30, 104, 109, 47, 44,\n72, 71]. Speci\ufb01cally, Zhang et al. [104] and Goyal et al. [32] focus on balancing the language priors\nof Antol et al. [9] for abstract and real images. Agrawal et al. [1] take away the IID assumption to\ncreate different distributions of answers for train and test splits, which further discourages transfer of\nlanguage priors. Hudson and Manning [37] balance open questions in addition to binary questions\n(as in Goyal et al. [32]). Image based dialog [20, 24, 21, 42, 60] can also be posed as a step by step\nimage based question answering and question generation [68, 43, 55] problem. Similarly related\nare question answering datasets built on videos [86, 64, 51, 52] and those based on visual embodied\nagents [31, 22].\nVarious models have been proposed for these tasks, particularly for VQA [9, 32], selecting sub-regions\nof an image [87], single attention [13, 98, 6, 19, 29, 82, 97, 39, 102], multimodal attention [59, 79, 70],\nmemory nets and knowledge bases [96, 94, 91, 62], improvements in neural architecture [66, 63, 7, 8]\nand bilinear pooling representations [29, 46, 12].\nExplainability. The effect of explanations on learning have been well studied in Cognitive Science\nand Psychology [57, 92, 93]. Explanations play a critical role in child development [50, 18] and\nmore generally in educational environments [15, 76, 77]. Explanation based models for applications\nin medicine & tutoring have been previously proposed [83, 88, 49, 17]. Inspired by these \ufb01ndings,\nlanguage and vision research on attention mechanism help to provide insights into decisions made\nby deep net models [59, 80]. Moreover, explainability in deep models has been investigated by\nmodifying CNNs to focus on object parts [106, 105], decomposing questions using neural modular\nsubstructures [8, 7, 23], and interpretable hidden units in deep models [10, 11]. Most relevant to our\nresearch are works on natural language explanations. This includes multimodal explanation [38] and\ntextual explanations for classi\ufb01er decisions [35] and self driving vehicles [45].\n\n2\n\n\fLogit\nMLP\n\nObject\n\ndetections\t\n\nPooling\n\nLSTM\n\nLSTM\n\nQuery\n\nResponse\n\nBERT\n\nObject\n\ndetections\t\n\nBERT\t\n\nembeds\tof\t\n\nor\n\nImage\nCNN\n\nImage\nCNN\n\nDownsample\n\nNet\n\nor\n\nOutput\n\n(a) Overview\n\n(a)\n\n(b) Joint image & language encoder\n\n(b)\n\nFigure 2: (a) Overview of the proposed TAB-VCR model: Inputs are the image (with object bounding boxes),\na query and a candidate response. Sentences (query & response) are represented using BERT embeddings and\nencoded jointly with the image using a deep net module f (\u00b7; \u2713). The representations of query and response are\nconcatenated and scored via a multi-layer perceptron (MLP); (b) Details of joint image & language encoder\nf (\u00b7; \u2713): BERT embeddings of each word are concatenated with their corresponding local image representation.\nThis information is pass through an LSTM and pooled to give the output f ((I, w); \u2713). The network components\noutlined in black , i.e., MLP, downsample net and LSTM are the only components with trainable parameters.\n\nVisual Commonsense Reasoning. The recently introduced Visual Commonsense Reasoning\ndataset [103] combines the above two research areas, studying explainability (reasoning) through two\nmultiple-choice subtasks. First, the question answering subtask requires to predict the answer to a\nchallenging question given an image. Second, and more connected to explainability, is the answer\njusti\ufb01cation subtask, which requires to predict the rationale given a question and a correct answer. To\nsolve the VCR task, Zellers et al. [103] base their model on a convolutional neural network (CNN)\ntrained for classi\ufb01cation. Instead, we associate VCR detections with visual attribute information\nto obtain signi\ufb01cant improvements with no architectural change or additional parameter cost. We\ndiscuss related work on visual attributes in the following.\nVisual attributes. Attributes are semantic properties to describe a localized object. Visual attributes\nare helpful to describe an unfamiliar object category [27, 48, 78]. Visual Genome [47] provides over\n100k images along with their scene graphs and attributes. Anderson et al. [5] capture attributes in\nvisual features by using an auxiliary attribute prediction task on a ResNet101 [34] backbone.\n\n3 Attribute-based Visual Commonsense Reasoning\nWe are interested in visual commonsense reasoning (VCR). Speci\ufb01cally, we study simple yet effective\nmodels and incorporate important information missed by previous methods \u2013 attributes and additional\nobject-text groundings. Given an input image, the VCR task is divided into two subtasks: (1) question\nanswering (Q!A): given a question (Q), select the correct answer (A) from four candidate answers;\n(2) answer justi\ufb01cation (QA!R): given a question (Q) and its correct answer (A), select the correct\nrationale (R) from four candidate rationales. Importantly, both subtasks can be uni\ufb01ed: choosing a\nresponse from four options given a query. For Q!A, the query is a question and the options are\ncandidate answers. For QA!R, the query is a question appended by its correct answer and the\noptions are candidate rationales. Note, the Q!AR task combines both, i.e., a model needs to succeed\nat both Q!A and QA!R. The proposed method focuses on choosing a response given a query, for\nwhich we introduce notation next.\nWe are given an image, a query, and four candidate responses. The words in the query and responses\nare grounded to objects in the image. The query and response are collections of words, while the\nimage data is a collection of object detections. One of the detections also corresponds to the entire\nimage, symbolizing a global representation. The image data is denoted by the set o = (oi)no\ni=1,\nwhere each oi, i 2{ 1, . . . , no}, consists of a bounding box bi and a class label li 2L 1. The query\nis composed of a sequence q = (qi)nq\ni=1, where each qi, i 2{ 1, . . . , nq}, is either a word in the\nvocabulary V or a tag referring to a bounding box in o. A data point consists of four responses and\n1The dataset also includes information about segmentation masks, which are neither used here nor by previous\n\nmethods. Data available at: visualcommonsense.com\n\n3\n\n\fVCR\n\nAttributes\n\nNew\tTag\n\nQuery:\u00a0\n\nResponse 1:\n\nResponse 2:\n\nResponse 3:\n\nResponse 4:\n\nQuestion Answering\n\n(Q)\tHow\tdid\t[0,\t1]\tget\there\t?\n\nThey\ttraveled\tin\ta\tcart\t.\n\nAnswer Justi\ufb01cation\n(Q)\tHow\tdid\t[0,\t1]\tget\there\n(A)\tThey\ttraveled\tin\ta\tcart\n\nPresumably\tthey\tcame\there\tto\tget\tsomething\tfrom\tthe\tstore\t.\n\n[0,\t1]\tgot\t[1]\treleased\tfrom\tjail\t.\n\nThey\tare\tat\ta\tmarket\tand\t[0]\t'\ts\tclothes\tlook\tlike\tthe\tlocals\tin\tthe\tbackground\t.\n\n[0,\t1]\ttook\tthe\tstairs\tto\tget\tup\tthere\t.\n\n[1]\tis\tholding\ta\tbag\twhich\tpeople\toften\tuse\tto\tcarry\tgroceries\t.\n\nThey\tboth\tgot\tsplashed\t.\n\nThe\tcart\tbeside\tthem\tis\tlikely\ttheir\tmode\tof\ttransportation\t.\n\n(a) Direct match of word cart (in text) and the same label (in image).\n\nVCR\n\nAttributes\n\nNew\tTag\n\nQuery:\u00a0\n\nResponse 1:\n\nResponse 2:\n\nResponse 3:\n\nResponse 4:\n\nQuestion Answering\n\n(Q)\tWill\t[0]\tgo\tto\twork\talone\t?\n\nNo\t,\the\twill\tnot\t.\n\nNo\t,\t[0]\twants\tto\tread\this\tpaper\t.\n\nAnswer Justi\ufb01cation\n\n(Q)\tWill\t[0]\tgo\tto\twork\talone\t?\n(A)\tNo\t,\t[1]\twill\tgo\twith\thim\t.\n\n[1,\t0]\tare\tin\tan\toffice\t,\tand\tit\tmight\tonly\thave\ta\tsingle\tbathroom\t.\n\nBoth\t[0,\t1]\tare\twearing\tlab\tcoats\tand\tare\tstanding\tin\tclose\tproximity\t\n\nto\tone\tanother\tindicating\tthey\tprobably\twork\ttogether\t.\n\nNo\t,\t[1]\twill\tgo\twith\thim\t.\n\nWhen\tthere\tare\ttwo\tpeople\ttogether\tand\tone\tgoes\taway\tmost\tof\tthe\ttime\tthe\tother\tfollows\t.\n\nYes\t,\the\twill\tbe\tthere\tfor\ta\twhile\t.\n\nMaids\tdo\tnot\tjoin\ttheir\temployers\twhen\tthey\tare\tdone\twith\ta\tjob\t,\t\n\nthey\twill\thave\tother\tthings\tthey\thave\tto\tget\tdone\t.\n\n(b) Word sense based match of word coats and label \u2018jacket\u2019 with the same meaning.\n\nFigure 3: Qualitative results: Two types of new tags found by our method are (a) direct matches and (b) word\nsense based matches. Note that the images on the left show the object detections provided by VCR. The images\nin the middle show the attributes predicted by our model and thereby captured in visual features. The images on\nthe right show new tags detected by our proposed method. Below the images are the question answering and\nanswer justi\ufb01cation subtasks.\n\nwe denote a response by the sequence r = (ri)nr\ni=1, where ri, i 2{ 1, . . . , nr}, (like the query) can\neither refer to a word in the vocabulary V or a tag.\nWe develop a conceptually simple joint encoder for language and image information, f ( \u00b7 ; \u2713), where\n\u2713 is the catch-all for all the trainable parameters.\nIn the remainder of this section, we \ufb01rst present an overview of our approach. Subsequently, we\ndiscuss details of the joint encoder f ( \u00b7 ; \u2713). Afterward, we introduce how to incorporate attribute\ninformation and \ufb01nd new tags, which helps improve the performance of our simple baseline. We\ndefer details about training and implementation to the supplementary material.\n\n3.1 Overview\nAs mentioned, visual commonsense reasoning requires to choose a response from four candidates.\nHere, we score each candidate separately. The separate scoring of responses is necessary to build a\nmore widely applicable framework, which is independent of the number of responses to be scored.\nOur proposed approach is outlined in Fig. 2(a). The three major components of our approach\nare: (1) BERT [25] embeddings for words; (2) a joint encoder f ( \u00b7 ; \u2713) to obtain (o, q) and (o, r)\nrepresentations; and (3) a multi-layer perceptron (MLP) to score these representations. Each word in\nthe query set q and response set r is embedded via BERT. The BERT embeddings of q and associated\nimage data from o are jointly encoded to obtain the representation f ((o, q); \u2713). An analogous\nrepresentation for responses is obtained via f ((o, r); \u2713). Note that the joint encoder is identical for\nboth the query and the response. The two representations are concatenated and scored via an MLP.\nThese scores or logits are further normalized using a softmax. The network is trained end-to-end\nusing a cross-entropy loss of predicted probabilities vis-\u00e0-vis correct responses.\n\n4\n\n\fif (pos_tag(w|w) 2{ NN, NNS}) and (wsd_synset(w, w) has a noun) then\n\n. Direct match between word and detections\n\n. Use word sense to match word and detections\n\nif w 2 \u02c6L then\n\nif w is tag then w remap(w)\n\nAlgorithm 1 Finding new tags\n1: Forward pass through image CNN to obtain object detections \u02c6o\n2: \u02c6L set(all class labels in \u02c6o)\n3: for w 2 w where w 2{ q, r} do\n4:\n5: new_tags {}\n6: for w 2 w where w 2{ q, r} do\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n\nnew_detections detections in \u02c6o corresponding to w\nadd (w, new_detections) to new_tags\nmax_wup 0\nword_lemma lemma(w)\nword_sense first_synset(word_lemma)\nfor \u02c6l 2 \u02c6L do\n\nelse\n\nif wup_similarity(first_synset(\u02c6l), word_sense) > max_wup then\n\nmax_wup wup_similarity(first_synset(\u02c6l), word_sense)\nbest_label \u02c6l\nif max_wup > k then\n\nnew_detections detections in \u02c6o corresponding to best_label\nadd (w, new_detections) to new_tags\n\nNext, we provide details of the joint encoder before we describe our approach to incorporate attributes\nand better image-text grounding, to improve the performance.\n\nJoint image & language encoder\n\n3.2\nThe joint language and image encoder is illustrated in Fig. 2(b). The inputs to the joint encoder are\nword embeddings of a sentence (either q or r) and associated object detections from o. The local\nimage region de\ufb01ned by these bounding boxes is encoded via an image CNN to a 2048 dimensional\nvector. This vector is projected to a 512 dimensional embedding, using a fully connected downsample\nnet. The language and image embeddings are concatenated and transformed using a long-short term\nmemory network (LSTM) [36]. Note that for non-tag words, i.e., words without an associated object\ndetection, the object detection corresponding to the entire image is utilized. The outputs of each\nunit of the LSTM are pooled together to obtain the \ufb01nal joint encoding of q (or r) and o. Note that\nthe network components with a black outline,\ni.e., the downsample net and LSTM are the only\ncomponents with trainable parameters. We design this so that no gradients need to be propagated\nback to the image CNN or to the BERT model, since both of them are parameter intensive, requiring\nsigni\ufb01cant training time and data. This choice facilitates the pre-computation of language and image\nfeatures for faster training and inference.\n\nImproving visual representation & image-text grounding\n\n3.3\nAttributes capturing visual features. Almost all previous VCR baselines have used a CNN trained\nfor ImageNet classi\ufb01cation to extract visual features. Note that the class label li for each bounding\nbox is already available in the dataset and incorporated in the models (previous and ours) via BERT\nembeddings. We hypothesize that visual question answering and reasoning bene\ufb01ts from information\nabout object characteristics and attributes. This intuition is illustrated in Fig. 3 where attributes add\nvaluable information to help reason about the scene, such as \u2018black picture,\u2019 \u2018gray tie,\u2019 and \u2018standing\nman.\u2019 To validate this hypothesis we deploy a pretrained attribute classi\ufb01er which augments every\ndetected bounding box bi with a set of attributes such as colors, texture, size, and emotions. We\nshow the attributes predicted by our model\u2019s image CNN in Fig. 1(a). For this, we take advantage of\nwork by Anderson et al. [5] as it incorporates attribute features to improve performance on language\nand vision tasks. Note that Zellers et al. [103] evaluate the model proposed by Anderson et al. [5]\nwith BERT embeddings to obtain 39.6% accuracy on the test set of the Q!AR task. As detailed\nin Sec. 4.3, with the same CNN and BERT embeddings, our network achieves 50.5%. We achieve\nthis by capturing recurrent information of LSTM modules via pooling and better scoring through an\n\n5\n\n\fMLP. This is in contrast to Zellers et al. [103], where the VQA 1000-way classi\ufb01cation is removed\nand the response representation is scored using a dot product.\nNew tags for better text to image grounding. Associating a word in the text with an object detection\nin the image, i.e., oi = (bi, li) is what we commonly refer to as text-image grounding. Any word\nserving as a pointer to a detection is referred to as a tag by Zellers et al. [103]. Importantly, many\nnouns in the text (query or responses) aren\u2019t grounded with their appearance in the image. We explain\npossible reasons in Sec. 4.4. To overcome this shortcoming, we develop Algorithm 1 to \ufb01nd new\ntext-image groundings or new tags. A qualitative example is illustrated in Fig. 3. Nouns such as \u2018cart\u2019\nand \u2018coats\u2019 weren\u2019t tagged by VCR, while our TAB-VCR model can tag them.\nSpeci\ufb01cally, for text-image grounding we \ufb01rst \ufb01nd detections \u02c6o (in addition to VCR provided o)\nusing the image CNN. The set of unique class labels in \u02c6o is assigned to \u02c6L. Both q and r are modi\ufb01ed\nsuch that all tags (pointers to detections in the image) are remapped to natural language (class label\nof the detection). This is done via the remap function. We follow Zellers et al. [103] and associate\na gender neutral name for the \u2018person\u2019 class. For instance, \u201cHow did [0,1] get here?\u201d in Fig. 3 is\nremapped to \u201cHow did Adrian and Casey get here?\u201d. This remapping is necessary for the next step of\nthe part-of-speech (POS) tagging which operates only on natural language.\nNext, the POS tagging function (pos_tag) parses a sentence w and assigns POS tags to each word\nw. For \ufb01nding new tags, we are only interested in words with the POS tag being either singular noun\n(NN) or plural noun (NNS). For these noun words, we check if a word w directly matches a label in\n\u02c6L. If such a direct match exists, we associate w to the detections of the matching label. As shown\nin Fig. 3(a), this direct matching associates the word cart in the text (response 1 of the Q!A subtask\nand response 4 of the QA!R subtask) to the detection corresponding to label \u2018cart\u2019 in the image,\ncreating a new tag.\nIf there is no such direct match for w, we \ufb01nd matches based on word sense. This is motivated\nin Fig. 3(b) where the word \u2018coat\u2019 has no direct match to any image label in \u02c6L. Rather there is a\ndetection of \u2018jacket\u2019 in the image. Notably, the word \u2018coat\u2019 has multiple word senses, such as \u2018an\nouter garment that has sleeves and covers the body from shoulder down\u2019 and \u2018growth of hair or wool\nor fur covering the body of an animal.\u2019 Also, \u2018jacket\u2019 has multiple word senses, two of which are \u2018a\nshort coat\u2019 and \u2018the outer skin of a potato\u2019. As can be seen, the \ufb01rst word senses of \u2018coat\u2019 and \u2018jacket\u2019\nare similar and would help match \u2018coat\u2019 to \u2018jacket.\u2019 Having said that, the second word senses are\ndifferent from common use and from each other. Hence, for words that do not directly match a label in\n\u02c6L, choosing the appropriate word sense is necessary. To this end, we adopt a simple approach, where\nwe use the most frequently used word sense of w and of labels in \u02c6L. This is obtained using the \ufb01rst\nsynset in Wordnet in NLTK [67, 58]. Then, using the \ufb01rst synset of w and labels in \u02c6L, we \ufb01nd the best\nmatching label \u2018best_label\u2019 corresponding to the highest Wu-Palmer similarity between synsets [95].\nAdditionally, we lemmatize w before obtaining its \ufb01rst synset. If the Wu-Palmer similarity between\nword w and the \u2018best_label\u2019 is greater than a threshold k, we associate the word to the detections of\n\u2018best_label.\u2019 Overall this procedure leads to new tags where text and label aren\u2019t the same but have\nthe same meaning. We found k = 0.95 was apt for our experiments. While inspecting, we found\nthis algorithm missed to match the word \u2018men\u2019 in the text to the detection label \u2018man.\u2019 This is due\nto the \u2018lemmatize\u2019 function provided by NLTK [58]. Consequently, we additionally allow new tags\ncorresponding to this \u2018men-man\u2019 match.\nThis algorithm permits to \ufb01nd new tags in 7.1% answers and 32.26% rationales. A split over correct\nand incorrect responses is illustrated in Fig. 4. These new tag detections are used by our new tag\nvariant TAB-VCR. If there is more than one detection associated with a new tag, we average the\nvisual features at the step before the LSTM in the joint encoder.\nImplementation details. We defer speci\ufb01c details about training, implementation and design choices\nto the supplementary material. The code can be found at https://github.com/deanplayerljx/\ntab-vcr.\n\n4 Experiments\nIn this section, we \ufb01rst introduce the VCR dataset and describe metrics for evaluation. Afterward, we\nquantitatively compare our approach and improvements to the current state-of-the-art method [103]\nand to top VQA models. We include a qualitative evaluation of TAB-VCR and an error analysis.\n\n6\n\n\fR2C (Zellers et al. [103])\n\nR2C + Det-BN\nR2C + Det-BN + Freeze (R2C++)\nR2C++ + Resnet101\nR2C++ + Resnet101 + Attributes\n\nQ!A QA!R Q!AR\n(val)\n(val)\n63.8\n43.1\n\n(val)\n67.2\n\nParams (Mn)\n(total)\n(trainable)\n35.3\n\n26.8\n\n64.49\n65.30\n67.55\n68.53\n\n67.02\n67.55\n68.35\n70.86\n\n43.61\n44.41\n46.42\n48.64\n\n35.3\n35.3\n54.2\n54.0\n\n26.8\n11.7\n11.7\n11.5\n\nImproving R2C\n\nOurs\n\nBase\nBase + Resnet101\nBase + Resnet101 + Attributes\nBase + Resnet101 + Attributes + New Tags (TAB-VCR)\nTable 1: Comparison of our approach to the current state-of-the-art R2C [103] on the validation set. Legend:\nDet-BN: Deterministic testing using train time batch normalization statistics. Freeze: Freeze all parameters of\nthe image CNN. ResNet101: ResNet101 backbone as image CNN (default is ResNet50). Attributes: Attribute\ncapturing visual features by using [5] (which has a ResNet101 backbone) as image CNN. Base: Our base model,\nas detailed in Fig. 2(b) and Sec. 3.1. New Tags: Augmenting object detection set with new tags (as detailed\nin Sec. 3.3), i.e., grounding additional nouns in the text to the image.\n\n46.19\n47.51\n50.08\n50.62\n\n66.39\n67.50\n69.51\n69.89\n\n69.02\n69.75\n71.57\n72.15\n\n28.4\n47.4\n47.2\n47.2\n\n4.9\n4.9\n4.7\n4.7\n\nModel\n\nRevisited [41]\nBottomUp [5]\n\nMLB [46]\n\nMUTAN [12]\nR2C [103]\n\nQ!A\n57.5\n62.3\n61.8\n61.0\n65.1\n70.4\n\nQA!R\n63.5\n63.0\n65.4\n64.4\n67.3\n71.7\n\nQ!AR\n36.8\n39.6\n40.6\n39.3\n44.0\n50.5\n\nTAB-VCR (ours)\nTable 2: Evaluation on test set: Accuracy on the three\nVCR tasks. Comparison with top VQA models + BERT\nperformance (source: [103]). Our best model outper-\nforms R2C [103] on the test set by a signi\ufb01cant margin.\n\nFigure 4: New tags: Percentage of response sen-\ntences with a new tag, i.e., a new grounding for\nnoun and object detection. Correct responses more\nlikely have new detections than incorrect ones.\n\n4.1 Dataset\nWe train our models on the visual commonsense reasoning dataset [103] which contains over 212k\n(train set), 26k (val set) and 25k (test set) questions on over 110k unique movie scenes. The scenes\nwere selected from LSMDC [75] and MovieClips, after they passed an \u2018interesting \ufb01lter.\u2019 For\neach scene, workers were instructed to created \u2018cognitive-level\u2019 questions. Workers answered these\nquestions and gave a reasoning or rationale for the answer.\n4.2 Metrics\nModels are evaluated with classi\ufb01cation accuracy on the Q!A, QA!R subtasks and the holistic\nQ!AR task. For train and validation splits, the correct labels are available for development. To\nprevent over\ufb01tting, the test set labels were not released. Since evaluation on the test set is a manual\neffort by Zellers et al. [103], we provide numbers for our best performing model on the test set and\nillustrate results for the ablation study on the validation set.\n4.3 Quantitative evaluation\nTab. 1 compares the performance of variants of our approach to the current state-of-the-art R2C [103].\nWhile we report validation accuracy on both subtasks (Q!A and QA!R) and the joint (Q!AR)\ntask in Tab. 1, in the following discussion we refer to percentages with reference to Q!AR.\nWe make two modi\ufb01cations to improve R2C. The \ufb01rst is Det-BN where we calculate and use train\ntime batch normalization [40] statistics. Second, we freeze all the weights of the image CNN\nin R2C, whereas Zellers et al. [103] keep the last block trainable. We provide a detailed study on\nfreeze later. With these two minor changes, we obtain an improvement (1.31%) in performance\nand a signi\ufb01cant reduction in trainable parameters (15Mn). We use the shorthand R2C++ to refer to\nthis improved variant of R2C.\nOur base model (described in Sec. 3) which includes (Det-BN) and Freeze improvements, improves\nover R2C++ by 1.78%, while being conceptually simple, having half the number of trainable parame-\n\n7\n\n\f(Q)\tIs\teveryone\tat\tschool\t?\n\n(Q)\tWhy\tis\t[0]\talso\tfocused\ton\t[1]\thands\t?\n\n(Q)\tDo\tyou\tthink\t[4]\twill\tsit\tdown\ton\t[9]\t?\n\nRight\tnow\tthere\tis\tno\tclasses\thappening.\n\n[1]\tis\tgiving\t[0]\ther\tphone\tnumber\t\n\nNo\tshe\twould\twalk\taround\tit\t.\n\nNo\tthey\tare\tnot.\n\nYes\tthey\tare\tat\tschool.\n\nYes,\ta\tschool\tor\tlibrary.\n\n(a) Similar responses\n\nSo\tshe\tknow\thow\tto\thold\tthe\tpose\tthat\t[2]\tis\tlearning\n\nYes\t,\tif\tshe\tdoesn\t'\tt\tdance\t,\tshe\twill\tsit\tsoon\t.\n\n[1]\tis\tremoving\ther\tgloves\tin\ta\tshow\tof\tflirtatious\tintent\n\nNo\t,\tshe\twon\t'\tt\t.\n\nShe\tis\tcompletely\tfocused\ton\tpushing\n\n(b) Missing context\n\nYes\t,\t[4]\twill\tput\ther\tglove\tback\ton\t,\tit\tis\ton\tthe\tbench\tnear\n\n[1]\t.\n\n(c) Future ambiguity\n\nFigure 5: Qualitative analysis of error modes: Responses with similar meaning (left), lack of context (middle)\nor ambiguity in future actions (right). Correct answers are marked with ticks and our model\u2019s incorrect prediction\nis outlined in red.\n\nEncoder\nShared\nUnshared\n\nQ!A QA!R Q!AR\n50.62\n69.89\n69.59\n50.35\n\n72.15\n72.25\n\nParams\n4.7M\n7.9M\n\nVCR subtask\n\nQ!A\nQA!R\n\nAvg. no. of tags in query+response\n(a) all\n(c) errors\n2.673\n4.293\n\n2.566\n4.013\n\n(b) correct\n\n2.719\n4.401\n\nTable 4: Error analysis as a function of number of tags.\nLess image-text grounding increases TAB-VCR errors.\n\nTable 3: Effect of shared vs. unshared parameters\nin the joint encoder f ( \u00b7 ; \u2713) of the TAB-VCR model.\nters. By using a more expressive ResNet as image CNN model (Base + Resnet101), we obtain\nanother 1.32% improvement. We obtain another big increase of 2.57% by leveraging attributes captur-\ning visual features (Base + Resnet101 + Attributes). Our best performing variant incorporates\nnew tags during training and inference (TAB-VCR) with a \ufb01nal 50.62% on the validation set. We\nablate R2C++ with ResNet101 and Attributes modi\ufb01cations, which leads to better performance\ntoo. This suggests our improvements aren\u2019t con\ufb01ned to our particular net. Additionally, we share the\nencoder for query and responses. We empirically studied the effect of sharing encoder parameters\nand found no signi\ufb01cant difference (Tab. 3) when using separate weights, which comes at the cost of\n3.2M extra trainable parameters. Note that Zellers et al. [103] also share the encoder for query and\nresponse processing. Hence, our design choice makes the comparison fair.\nIn Tab. 2 we show results evaluating the performance of TAB-VCR on the private test set, set aside\nby Zellers et al. [103]. We obtain a 5.3%, 4.4% and 6.5% absolute improvement over R2C on the test\nset. We perform much better than top VQA models which were adapted for VCR in [103]. Models\nevaluated on the test set are posted on the leaderboard2. We appear as \u2018TAB-VCR\u2019 and outperform\nprior peer-reviewed work. At the time of writing (23rd May 2019) TAB-VCR ranked second in\nthe single model category. After submission of this work other reports addressing VCR have been\nreleased. At the time of submitting this camera-ready (27th Oct 2019), TAB-VCR ranked seventh\namong single models on the leaderboard. Based on the available reports [54, 85, 4, 53, 61, 14], most\nof these seven methods capture the idea of re-training BERT with extra information from Conceptual\nCaptions [81]. This, in essence, is orthogonal to our new tags and attributes approach to build simple\nand effective baselines with signi\ufb01cantly fewer parameters.\nFig. 4 illustrates the effectiveness of our new tag detection, where 10.4% correct answers had at least\none new tag detected. With 38.93%, the number is even higher for correct rationales. This is intuitive\nas humans refer to more objects while reasoning about an answer than the answer itself.\nFinetuning vs. freezing last conv block. In Tab. 5 we study the effect of \ufb01netuning the last conv\nblock of ResNet101 and the downsample net. Zellers et al. [103] use row #1. We assess lower learning\nrates \u2013 0.5x, 0.25x, and 0.125x (#2 to #4). We chose to freeze the conv block (#5) to reduce trainable\nparameters by 15M, with slight improvement in performance. By comparing #5 and #6, we \ufb01nd the\npresence of downsample net to reduce the model size and improve performance. After conducting this\nablation study for the base model\u2019s architecture design, we updated the python dependency packages.\nThis update lead to a slight difference in the accuracy of #5 in Tab. 5 (before the update) and the \ufb01nal\naccuracy reported in Tab. 1 (after the update). However, the versions of python dependencies are\nconsistent across all variants listed in Tab. 5.\n\n2visualcommonsense.com/leaderboard\n\n8\n\n\fTrainable\n\nparams (mn)\n\n4th conv\nblock\n\nDownsample\n\n#\n\nnet\n\n(1/2)\n(1/4)\n(1/8)\n\nQ!A QA!R Q!AR\n19.9\n44.60\n64.57\n68.86\n1\n19.9\n44.08\n64.26\n68.14\n2\n19.9\n42.87\n63.11\n67.73\n3\n19.9\n43.21\n63.51\n67.49\n4\n4.9\n66.47\n46.45\n69.22\n5\n45.57\n7.0\n65.30\n6\n69.09\n: Finetuning\nTable 5: Ablation for base model:\nand\n: Freezing weights of the fourth conv block\nin ResNet101 image CNN. Presence and absence\nof downsample net (to project image representation\nfrom 2048 to 512) is denoted by\n\nand\n\n.\n\nQues. type Matching patterns\n\nwhat\nwhy\n\nCounts Q!A QA!R\n72.74\n10688\n73.02\n9395\n67.70\nis, are, was, were, isn\u2019t 1768\n73.09\n1546\n69.19\n1350\n655\n65.80\n69.78\n556\n307\n73.29\n\n72.30\n65.14\n75.17\n73.54\n60.67\n72.82\n86.69\n74.92\n\nwho, whom, whose\nwill, would, wouldn\u2019t\n\ndo, did, does\n\nwhere\nhow\n\nwhat\nwhy\nisn\u2019t\nwhere\nhow\ndo\nwho\nwill\n\nTable 6: Accuracy by question type (with at least\n100 counts) of TAB-VCR model. Why & how ques-\ntions are most challenging for the Q!A subtask.\n\n4.4 Qualitative evaluation and error analysis\nWe illustrate the qualitative results in Fig. 3. We separate the image input to our model into three\nparts, for easy visualization. We show VCR detections & labels, attribute prediction of our image\nCNN and new tags in the left, middle and right images. Note how our model can ground important\nwords. For instance, for the example shown in Fig. 3(a), the correct answer and rationale prediction\nis based on the cart in the image, which we ground. The word \u2018cart\u2019 wasn\u2019t grounded in the original\nVCR dataset. Similarly, grounding the word coats helps to answer and reason about the example\nin Fig. 3(b).\nExplanation for missed tags. As discussed in Sec. 3.3, the VCR dataset contains various nouns\nthat aren\u2019t tagged such as \u2018eye,\u2019 \u2018coats\u2019 and \u2018cart\u2019 as highlighted in Fig. 1 and Fig. 3. This could be\naccounted to the methodology adopted for collecting the VCR dataset. Zellers et al. [103] instructed\nworkers to provide questions, answers, and rationales by using natural language and object detections\no (COCO [56] objects). We found that workers used natural language even if the corresponding\nobject detection was available. Additionally, for some data points, we found objects mentioned in\nthe text without a valid object detection in o. This may be because the detector used by Zellers et al.\n[103] is trained on COCO [56], which has only 80 classes.\nError modes. We also qualitatively study TAB-VCR\u2019s shortcomings by analyzing error modes, as\nillustrated in Fig. 5. The correct answer is marked with a tick while our prediction is outlined in red.\nExamples include options with overlapping meaning (Fig. 5(a)). Both the third and the fourth answers\nhave similar meaning which could be accounted for the fact that Zellers et al. [103] automatically\ncurated competing incorrect responses via adversarial matching. Our method misses the \u2018correct\u2019\nanswer. Another error mode (Fig. 5(b)) is due to objects which aren\u2019t present in the image, like the\n\u201cgloves in a show of \ufb02irtatious intent.\u201d This could be accounted to the fact that crowd workers were\nshown context from the video in addition to the image (video caption), which isn\u2019t available in the\ndataset. Also, as highlighted in Fig. 5(c), scenes often offer an ambiguous future, and our model gets\nsome of these cases incorrect.\nError and grounding. In Tab. 4, we provide the average number of tags in the query+response\nfor both subtasks. We state this value for the following subsets: (a) all datapoints, (b) datapoints\nwhere TAB-VCR was correct, and (c) datapoints where TAB-VCR made errors. Based on this, we\ninfer that our model performs better on datapoints with more tags, i.e., richer association of image\nand text.\nError and question types. In Tab. 6 we show the accuracy of the TAB-VCR model based on\nquestion type de\ufb01ned by the corresponding matching patterns. Our model is more error-prone on why\nand how questions on the Q!A subtask, which usually require more complex reasoning.\n5 Conclusion\nWe develop a simple yet effective baseline for visual commonsense reasoning. The proposed approach\nleverages additional object detections to better ground noun-phrases and assigns attributes to current\nand newly found object groundings. Without an intricate and meticulously designed attention model,\nwe show that the proposed approach outperforms state-of-the-art, despite signi\ufb01cantly fewer trainable\nparameters. We think this simple yet effective baseline and the new noun-phrase grounding can\nprovide the basis for further development of visual commonsense models.\n\n9\n\n\fAcknowledgements\n\nThis work is supported in part by NSF under Grant No. 1718221 and MRI #1725729, UIUC, Samsung,\n3M, Cisco Systems Inc. (Gift Award CG 1377144) and Adobe. We thank NVIDIA for providing\nGPUs used for this work and Cisco for access to the Arcetri cluster. The authors thank Prof. Svetlana\nLazebnik for insightful discussions and Rowan Zellers for releasing and helping us navigate the VCR\ndataset & evaluation.\n\nReferences\n\n[1] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don\u2019t just assume; look and answer: Overcoming\n\npriors for visual question answering. In CVPR, 2018.\n\n[2] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social LSTM: Human\n\ntrajectory prediction in crowded spaces. In CVPR, 2016.\n\n[3] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. Unsupervised\n\nlearning from narrated instruction videos. In CVPR, 2016.\n\n[4] C. Alberti, J. Ling, M. Collins, and D. Reitter. Fusion of detected objects in text for visual question\n\nanswering. arXiv preprint arXiv:1908.05054, 2019.\n\n[5] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down\n\nattention for image captioning and visual question answering. In CVPR, 2018.\n\n[6] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural\n\nmodule networks. In CVPR, 2016.\n\n[7] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question\n\nanswering. NAACL, 2016.\n\n[8] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016.\n\n[9] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual question\n\nanswering. In ICCV, 2015.\n\n[10] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability\n\nof deep visual representations. In CVPR, 2017.\n\n[11] D. Bau, J.-Y. Zhu, H. Strobelt, Z. Bolei, J. B. Tenenbaum, W. T. Freeman, and A. Torralba. Gan dissection:\n\nVisualizing and understanding generative adversarial networks. In ICLR, 2019.\n\n[12] H. Ben-younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodal tucker fusion for visual question\n\nanswering. In ICCV, 2017.\n\n[13] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. Abc-cnn: An attention based convolutional\n\nneural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015.\n\n[14] Y.-C. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu. Uniter: Learning universal\n\nimage-text representations. arXiv preprint arXiv:1909.11740, 2019.\n\n[15] M. T. Chi, M. Bassok, M. W. Lewis, P. Reimann, and R. Glaser. Self-explanations: How students study\n\nand use examples in learning to solve problems. Cognitive science, 1989.\n\n[16] C.-Y. Chuang, J. Li, A. Torralba, and S. Fidler. Learning to act properly: Predicting and explaining\n\naffordances from images. In CVPR, 2018.\n\n[17] M. G. Core, H. C. Lane, M. Van Lent, D. Gomboc, S. Solomon, and M. Rosenberg. Building explainable\n\narti\ufb01cial intelligence systems. In AAAI, 2006.\n\n[18] K. Crowley and R. S. Siegler. Explanation and generalization in young children\u2019s strategy learning. Child\n\ndevelopment, 1999.\n\n[19] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering:\n\nDo humans and deep networks look at the same regions? In EMNLP, 2016.\n\n[20] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual Dialog. In\n\nCVPR, 2017.\n\n10\n\n\f[21] A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with\n\ndeep reinforcement learning. In ICCV, 2017.\n\n[22] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied Question Answering. In CVPR,\n\n2018.\n\n[23] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Neural Modular Control for Embodied Question\n\nAnswering. In ECCV, 2018.\n\n[24] H. de Vries, F. Strub, A. P. S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville. Guesswhat?!\n\nvisual object discovery through multi-modal dialogue. CVPR, 2017.\n\n[25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers\n\nfor language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[26] K. Ehsani, H. Bagherinezhad, J. Redmon, R. Mottaghi, and A. Farhadi. Who let the dogs out? modeling\n\ndog behavior from visual data. In CVPR, 2018.\n\n[27] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.\n\n[28] P. Felsen, P. Agrawal, , and J. Malik. What will happen next? forecasting player moves in sports videos.\n\nIn 2017, CVPR.\n\n[29] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear\n\npooling for visual question answering and visual grounding. In EMNLP, 2016.\n\n[30] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? Dataset and\n\nMethods for Multilingual Image Question Answering. In NeurIPS, 2015.\n\n[31] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. IQA: Visual Question\n\nAnswering in Interactive Environments. In CVPR, 2018.\n\n[32] Y. Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter:\n\nElevating the role of image understanding in visual question answering. IJCV, 2017.\n\n[33] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social gan: Socially acceptable trajectories\n\nwith generative adversarial networks. In CVPR, 2018.\n\n[34] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n\n[35] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. Generating visual\n\nexplanations. In ECCV, 2016.\n\n[36] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.\n\n[37] D. A. Hudson and C. D. Manning. Gqa: a new dataset for compositional question answering over\n\nreal-world images. In CVPR, 2019.\n\n[38] D. Huk Park, L. Anne Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach.\n\nMultimodal explanations: Justifying decisions and pointing to the evidence. In CVPR, 2018.\n\n[39] I. Ilievski, S. Yan, and J. Feng. A focused dynamic attention model for visual question answering. arXiv\n\npreprint arXiv:1604.01485, 2016.\n\n[40] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In ICML, 2015.\n\n[41] A. Jabri, A. Joulin, and L. van der Maaten. Revisiting visual question answering baselines. In ECCV,\n\n2016.\n\n[42] U. Jain, S. Lazebnik, and A. G. Schwing. Two can play this Game: Visual Dialog with Discriminative\n\nQuestion Generation and Answering. In CVPR, 2018.\n\n[43] U. Jain\u21e4, Z. Zhang\u21e4, and A. G. Schwing. Creativity: Generating Diverse Questions using Variational\n\nAutoencoders. In CVPR, 2017. \u21e4 equal contribution.\n\n[44] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A\n\ndiagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.\n\n[45] J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. Textual explanations for self-driving vehicles.\n\nIn ECCV, 2018.\n\n11\n\n\f[46] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard product for low-rank bilinear\n\npooling. ICLR, 2017.\n\n[47] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A.\nShamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image\nannotations. International Journal of Computer Vision, 123(1):32\u201373, 2017.\n\n[48] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\n\nattribute transfer. In CVPR, 2009.\n\n[49] H. C. Lane, M. G. Core, M. van Lent, S. Solomon, and D. Gomboc. Explainable arti\ufb01cial intelligence for\n\ntraining and tutoring. In AIED, 2005.\n\n[50] C. H. Legare and T. Lombrozo. Selective effects of explanation on learning during early childhood.\n\nJournal of experimental child psychology, 2014.\n\n[51] J. Lei, L. Yu, M. Bansal, and T. L. Berg. Tvqa: Localized, compositional video question answering. In\n\nEMNLP, 2018.\n\n[52] J. Lei, L. Yu, T. L. Berg, and M. Bansal. Tvqa+: Spatio-temporal grounding for video question answering.\n\nIn Tech Report, arXiv, 2019.\n\n[53] G. Li, N. Duan, Y. Fang, D. Jiang, and M. Zhou. Unicoder-vl: A universal encoder for vision and language\n\nby cross-modal pre-training. arXiv preprint arXiv:1908.06066, 2019.\n\n[54] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang. Visualbert: A simple and performant baseline\n\nfor vision and language. arXiv preprint arXiv:1908.03557, 2019.\n\n[55] Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation as\ndual task of visual question answering. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 6116\u20136124, 2018.\n\n[56] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\ncoco: Common objects in context. In European conference on computer vision, pages 740\u2013755. Springer,\n2014.\n\n[57] T. Lombrozo. Explanation and abductive inference. Oxford handbook of thinking and reasoning, 2012.\n\n[58] E. Loper and S. Bird. Nltk: the natural language toolkit. arXiv preprint cs/0205028, 2002.\n\n[59] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question\n\nanswering. In NeurIPS, 2016.\n\n[60] J. Lu, A. Kannan, , J. Yang, D. Parikh, and D. Batra. Best of both worlds: Transferring knowledge from\n\ndiscriminative learning to a generative visual dialog model. NeurIPS, 2017.\n\n[61] J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations\n\nfor vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019.\n\n[62] C. Ma, C. Shen, A. Dick, Q. Wu, P. Wang, A. van den Hengel, and I. Reid. Visual question answering\n\nwith memory-augmented networks. In CVPR, 2018.\n\n[63] L. Ma, Z. Lu, and H. Li. Learning to answer questions from image using convolutional neural network.\n\nIn AAAI, 2016.\n\n[64] T. Maharaj, N. Ballas, A. Rohrbach, A. Courville, and C. Pal. A dataset and exploration of models for\n\nunderstanding video data through \ufb01ll-in-the-blank question-answering. In CVPR, 2017.\n\n[65] M. Malinowski and M. Fritz. A Multi-World Approach to Question Answering about Real-World Scenes\n\nbased on Uncertain Input. In NeurIPS, 2014.\n\n[66] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering\n\nquestions about images. In ICCV, 2015.\n\n[67] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39\u201341, 1995.\n\n[68] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural\n\nquestions about an image. arXiv preprint arXiv:1603.06059, 2016.\n\n12\n\n\f[69] R. Mottaghi, M. Rastegari, A. Gupta, and A. Farhadi. what happens if... learning to predict the effect of\n\nforces in images. In ECCV, 2016.\n\n[70] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. In\n\nCVPR, 2017.\n\n[71] M. Narasimhan and A. G. Schwing. Straight to the Facts: Learning Knowledge Base Retrieval for Factual\n\nVisual Question Answering. In Proc. ECCV, 2018.\n\n[72] M. Narasimhan, S. Lazebnik, and A. G. Schwing. Out of the Box: Reasoning with Graph Convolution\n\nNets for Factual Visual Question Answering. In Proc. NIPS, 2018.\n\n[73] M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In NeurIPS,\n\n2015.\n\n[74] N. Rhinehart and K. M. Kitani. First-person activity forecasting with online inverse reinforcement\n\nlearning. In ICCV, 2017.\n\n[75] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele.\n\nMovie description. IJCV, 2017.\n\n[76] R. D. Roscoe and M. T. Chi. Tutor learning: The role of explaining and responding to questions.\n\nInstructional Science, 2008.\n\n[77] J. A. Ross and J. B. Cousins. Giving and receiving explanations in cooperative learning groups. Alberta\n\nJournal of Educational Research, 1995.\n\n[78] O. Russakovsky and L. Fei-Fei. Attribute learning in large-scale datasets. In ECCV, 2010.\n\n[79] I. Schwartz, A. G. Schwing, and T. Hazan. High-Order Attention Models for Visual Question Answering.\n\nIn NeurIPS, 2017.\n\n[80] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual\n\nexplanations from deep networks via gradient-based localization. In ICCV, 2017.\n\n[81] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image\n\nalt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.\n\n[82] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In\n\nCVPR, 2016.\n\n[83] E. H. Shortliffe and B. G. Buchanan. A model of inexact reasoning in medicine. Mathematical biosciences,\n\n1975.\n\n[84] K. K. Singh, K. Fatahalian, and A. A. Efros. Krishnacam: Using a longitudinal, single-person, egocentric\n\ndataset for scene understanding tasks. In WACV, 2016.\n\n[85] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai. Vl-bert: Pre-training of generic visual-linguistic\n\nrepresentations. arXiv preprint arXiv:1908.08530, 2019.\n\n[86] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding\n\nstories in movies through question-answering. In CVPR, 2016.\n\n[87] T. Tommasi, A. Mallya, B. Plummer, S. Lazebnik, A. C. Berg, and T. L. Berg. Combining multiple cues\nfor visual madlibs question answering. International Journal of Computer Vision, 127(1):38\u201360, 2019.\n\n[88] M. van Lent, W. Fisher, and M. Mancuso. An explainable arti\ufb01cial intelligence system for small-unit\n\ntactical behavior. In AAAI, 2004.\n\n[89] P. Vicol, M. Tapaswi, L. Castrejon, and S. Fidler. Moviegraphs: Towards understanding human-centric\n\nsituations from videos. In CVPR, 2018.\n\n[90] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabeled video. In\n\nCVPR, 2016.\n\n[91] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick. Explicit knowledge-based reasoning for visual\n\nquestion answering. IJCAI, 2017.\n\n[92] J. J. Williams and T. Lombrozo. The role of explanation in discovery and generalization: Evidence from\n\ncategory learning. Cognitive Science, 2010.\n\n13\n\n\f[93] J. J. Williams and T. Lombrozo. Explanation and prior knowledge interact to guide learning. Cognitive\n\npsychology, 2013.\n\n[94] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question\n\nanswering based on knowledge from external sources. In CVPR, 2016.\n\n[95] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting\n\non Association for Computational Linguistics, 1994.\n\n[96] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering.\n\nIn ICML, 2016.\n\n[97] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual\n\nquestion answering. In ECCV, 2016.\n\n[98] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering.\n\nIn CVPR, 2016.\n\n[99] T. Ye, X. Wang, J. Davidson, and A. Gupta. Interpretable intuitive physics model. In ECCV, 2018.\n\n[100] Y. Yoshikawa, J. Lin, and A. Takeuchi. Stair actions: A video dataset of everyday home actions. In arXiv\n\npreprint arXiv:1804.04326, 2018.\n\n[101] L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual madlibs: Fill in the blank image generation and question\n\nanswering. ICCV, 2015.\n\n[102] Z. Yu, J. Yu, J. Fan, and D. Tao. Multi-modal factorized bilinear pooling with co-attention learning for\n\nvisual question answering. ICCV, 2017.\n\n[103] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. From recognition to cognition: Visual commonsense\n\nreasoning. In CVPR, 2019.\n\n[104] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Balancing and answering\n\nbinary visual questions. In CVPR, 2016.\n\n[105] Q. Zhang, Y. Nian Wu, and S.-C. Zhu. Interpretable convolutional neural networks. In CVPR, 2018.\n\n[106] Q. Zhang, Y. Yang, H. Ma, and Y. N. Wu. Interpreting cnns via decision trees. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 6261\u20136270, 2019.\n\n[107] L. Zhou, C. Xu, and J. J. Corso. Towards automatic learning of procedures from web instructional videos.\n\nIn AAAI, 2018.\n\n[108] Y. Zhou and T. L. Berg. Temporal perception and prediction in ego-centric video. In Proceedings of the\n\nIEEE International Conference on Computer Vision, pages 4498\u20134506, 2015.\n\n[109] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded Question Answering in Images. In\n\nCVPR, 2016.\n\n14\n\n\f", "award": [], "sourceid": 9052, "authors": [{"given_name": "Jingxiang", "family_name": "Lin", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Unnat", "family_name": "Jain", "institution": "University of Illinois at Urbana Champaign"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}]}