{"title": "Heterogeneous Graph Learning for Visual Commonsense Reasoning", "book": "Advances in Neural Information Processing Systems", "page_first": 2769, "page_last": 2779, "abstract": "Visual commonsense reasoning task aims at leading the research field into solving cognition-level reasoning with the ability to predict correct answers and meanwhile providing convincing reasoning paths, resulting in three sub-tasks i.e., Q->A, QA->R and Q->AR. It poses great challenges over the proper semantic alignment between vision and linguistic domains and knowledge reasoning to generate persuasive reasoning paths. Existing works either resort to a powerful end-to-end network that cannot produce interpretable reasoning paths or solely explore intra-relationship of visual objects (homogeneous graph) while ignoring the cross-domain semantic alignment among visual concepts and linguistic words. In this paper, we propose a new Heterogeneous Graph Learning (HGL) framework for seamlessly integrating the intra-graph and inter-graph reasoning in order to bridge the vision and language domain. Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement. Moreover, our HGL integrates a contextual voting module to exploit a long-range visual context for better global reasoning. Experiments on the large-scale Visual Commonsense Reasoning benchmark demonstrate the superior performance of our proposed modules on three tasks (improving 5% accuracy on Q->A, 3.5% on QA->R, 5.8% on Q->AR).", "full_text": "Heterogeneous Graph Learning for Visual\n\nCommonsense Reasoning\n\nWeijiang Yu1, Jingwen Zhou1, Weihao Yu1, Xiaodan Liang2,\u2217, Nong Xiao1\n\n1School of Data and Computer Science, Sun Yat-sen University\n\n2School of Intelligent Systems Engineering, Sun Yat-sen University\n\nweijiangyu8@gmail.com, zhoujw57@mail2.sysu.edu.cn, weihaoyu6@gmail.com,\n\nxdliang328@gmail.com, xiaon6@sysu.edu.cn\n\nAbstract\n\nVisual commonsense reasoning task aims at leading the research \ufb01eld into solv-\ning cognition-level reasoning with the ability of predicting correct answers and\nmeanwhile providing convincing reasoning paths, resulting in three sub-tasks i.e.,\nQ\u2192A, QA\u2192R and Q\u2192AR. It poses great challenges over the proper semantic\nalignment between vision and linguistic domains and knowledge reasoning to\ngenerate persuasive reasoning paths. Existing works either resort to a powerful\nend-to-end network that cannot produce interpretable reasoning paths or solely\nexplore intra-relationship of visual objects (homogeneous graph) while ignoring the\ncross-domain semantic alignment among visual concepts and linguistic words. In\nthis paper, we propose a new Heterogeneous Graph Learning (HGL) framework for\nseamlessly integrating the intra-graph and inter-graph reasoning in order to bridge\nvision and language domain. Our HGL consists of a primal vision-to-answer hetero-\ngeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph\n(QAHG) module to interactively re\ufb01ne reasoning paths for semantic agreement.\nMoreover, our HGL integrates a contextual voting module to exploit long-range\nvisual context for better global reasoning. Experiments on the large-scale Visual\nCommonsense Reasoning benchmark demonstrate the superior performance of\nour proposed modules on three tasks (improving 5% accuracy on Q\u2192A, 3.5% on\nQA\u2192R, 5.8% on Q\u2192AR)2.\n\n1\n\nIntroduction\n\nVisual and language tasks have attracted more and more researches, which contains visual question\nanswering (VQA) [29, 35, 36], visual dialogue [15, 10], visual question generation (VQG) [20, 30],\nvisual grounding [17, 11, 41] and visual-language navigation [39, 21]. These tasks can roughly\nbe divided into two types. One type is to explore a powerful end-to-end network. Devlin et al.\nhave introduced a powerful end-to-end network named BERT [12] for learning more discriminative\nrepresentation of languages. Anderson et al [2]. utilized the attention mechanisms [42] and presented\na bottom-up top-down end-to-end architecture. While these works resorted to a powerful end-\nto-end network that cannot produce interpretable reasoning paths. The other type is to explore\nintra-relationship of visual objects (homogeneous graph). Norcliffe-Brown et al. [32] presented a\nspatial graph and a semantic graph to model object location and semantic relationships. Tang et\nal. [37] modeled the intra-relationship of visual objects by composing dynamic tree structures that\nplace the visual objects into a visual context. However, all of them solely consider intra-relationship\nlimited to homogeneous graphs, which is not enough for visual commonsense reasoning (VCR) due\n\n\u2217Corresponding author is Xiaodan Liang\n2Our code is released in https://github.com/yuweijiang/HGL-pytorch\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: (a) Answer-to-answer homogeneous graph is to model intra-relationship of each word in\nall answers of linguistics; (b) Vision-to-vision homogeneous graph is to mine intra-relationship of\neach object from images; (c) Vision-to-Answer heterogeneous graph is to excavate inter-relationship\nbetween object and answer. The dotted portion in (a)&(b) means information isolated island. The\nconcept of information isolated island in our paper refers to the independent of different semantic\nnodes can not achieve semantic inference in a homogeneous graph that connects similar semantic\nnodes by attribute (e.g. Figure 1(a)) or grammar (e.g. Figure 1(b)).\n\nto its high demand of the proper semantic alignment between vision and linguistic domain. In this\npaper, we resolve the challenge via heterogeneous graph learning, which seamlessly integrates the\nintra-graph and inter-graph reasoning to bridge vision and language domain.\nCurrent approaches mainly fall into homogeneous graph modeling with same domain (e.g. vision-\nto-vision graph) to mine the intra-relationship. However, one of the keys to the cognition-level\nproblem is to excavate the inter-relationship of vision and linguistics (e.g. vision-to-answer graph) by\naligning the semantic nodes between two different domains. As is shown in Figure 1, we show the\ndifference of homogeneous graphs and heterogeneous graphs. The homogeneous graph may lead to\ninformation isolated island (dotted portion in Figure 1(a)(b)), such as vision-to-vision graph that is\nlimited to object relationship and is not related to functionality-based information (e.g. get, pours into)\nfrom linguistics, which would hinder semantic inference. For instance, \"person1 pours bottle2 into\nwineglass2\", this sentence contains functionality-based information \"pours into\" which is different\nfrom the semantic visual entities such as \"person1\", \"bottle2\" and \"wineglass2\". It may not be\nenough to bridge the vision and language domain via homogeneous graph as shown in Figure 1(a)(b).\nHowever, we can connect the visual semantic node (e.g. \"person1\") with functional node (e.g. \"pours\ninto\") from linguistics via a heterogeneous graph (Figure 1(c)), which can support proper semantic\nalignment between vision and linguistic domain. Bene\ufb01t from the merits of the heterogeneous graph,\nwe can seamlessly connect the inter-relationship between vision and linguistics, which can re\ufb01ne the\nreasoning path for semantic agreement. Here, we propose to use the heterogeneous graph learning\nfor VCR task to support the visual representation being aligned with linguistics.\nA heterogeneous graph module including a primal vision-to-answer heterogeneous graph (VAHG)\nand a dual question-to-answer heterogeneous graph (QAHG) is the core of Heterogeneous Graph\nLearning (HGL), which contains two steps: (1) build a heterogeneous graph and evolve the graph; (2)\nutilize the evolved graph to guide the answer selection. First, given the generated node representation\nof vision and linguistics as input, the con\ufb01dent weights are utilized to learn the correlation of each\nnode. Then, to locate relevant node relationship in the heterogeneous graph conditioned by the given\nquestion and image, we utilize heterogeneous graph reasoning to get the evolved heterogeneous graph\nrepresentation. Finally, given evolved graph representation, a guidance mechanism is utilized to route\nthe correct output content.\nMoreover, there exists some ambiguous semantics (e.g. rainy day) that lack of speci\ufb01c labels for\ndetection and can not bene\ufb01t from the labeled object bounding boxes and categories such as \"person\"\nand \"dog\" during training in VCR task. To solve this problem, our HGL integrates a contextual voted\nmodule (CVM) for visual scene understanding with a global perspective at the low-level features.\nThe key merits of our paper lie in four aspects: a) a framework called HGL is introduced to seamlessly\nintegrate the intra-graph and inter-graph in order to bridge vision and linguistic domain, which consists\nof a heterogeneous graph module and a CVM; b) a heterogeneous graph module is proposed including\na primal VAHG and a dual QAHG to collaborate with each other via heterogeneous graph reasoning\nand guidance mechanism; c) a CVM is presented to provide a new perspective for global reasoning;\nd) extensive experiments have demonstrated the state-of-the-art performance of our proposed HGL\non three cognition-level tasks.\n\n2\n\nperson1poursbottle2wineglass2intoonbottle1diningtable1cup1(a) Answer-to-Answer Homogeneous Graphperson1person2person3bottle2bottle1diningtable1wineglass2wineglass1cup1(b) Vision-to-Vision Homogeneous Graphperson1poursbottle2wineglass2intoonwineglass1cup1diningtable1bottle1person3person2(c) Vision-to-Answer Heterogeneous Graph\fFigure 2: Overview of our HGL framework. Taking the image, question and candidate answers with\nfour-way multiple choices as input, we use HGL to predict the right choice of candidate answers.\nWe \ufb01rstly use CNN (ResNet50 [16]) tailed with the CVM to obtain the visual representation with\nglobal reasoning. Then we utilize the shared BERT [12] to extract question representation and\ncandidate answer representation, respectively. Then taking the three representations as the input\nof heterogeneous graph module, a primal VAHG module with a dual QAHG module is used to\nconstruct heterogeneous graph relationship to align semantics among vision, question and answer via\nheterogeneous graph reasoning and guidance, which outputs two evolved representations. The two\nrepresentations are fed into a parser and classi\ufb01cation to classify the \ufb01nal result.\n\n2 Related Work\n\nVisual Comprehension Recently, visual comprehension has made signi\ufb01cant progress in many tasks,\nsuch as visual question answering [18, 4, 3], visual dialog [15, 37, 9] and visual question genera-\ntion [25, 20]. There are mainly two aspects to methodology in the domain of visual comprehension.\nOn the one hand, an attention-based approach was usually applied and raised its superior performance.\nAnderson et al. [2] presented a powerful architecture driven via bottom-up and top-down attention\nfor image captioning and visual question answering. In multi-hop reasoning question answering\ntask, a bi-directional attention mechanism [7] was proposed that was combined with entity graph\nconvolutional network to obtain the relation-aware representation of nodes for entity graphs. On the\nother hand, a graph-based approach was developing rapidly recently, combining graph representation\nof questions and topic images with graph neural networks. Wu et al. [40] incorporated high-level\nconcepts such as external knowledge into the successful CNN-RNN approach for image captioning\nand visual question answering. A graph-based approach [32] to model object-level relationships\nconditioned by the given question and image, including spatial relationship and object semantics\nrelationship. In contrast, our proposed HGL differs in that inter-relationship is built via different\ndomains (e.g. vision-to-answer graph) to align vision and linguistic domains.\nGraph Learning Some researchers effort to model domain knowledge as homogeneous graph for\nexcavating correlations among labels or objects in images, which has been proved effective in many\ntasks [31, 28]. Graph convolution approaches with spectral variants [5] and diffusion approaches [13]\nhave well developed and been applied into semi-supervised node classi\ufb01cation [24]. Some researches\nutilize the adjacency matrix to model the relationship of all node pairs [38, 6], while others incorporate\nhigher-order structure inspired by simulated random walks [1, 14]. Li et al. [26] solved scene graph\ngeneration via subgraph-based model using bottom-up relationship inference of objects in images.\nLiang et al. [27] modeled semantic correlations via neural-symbolic graph for explicitly incorporating\nsemantic concept hierarchy during propagation. Yang et al. [43] built a prior knowledge-guide\ngraph for body part locations to well consider the global pose con\ufb01gurations. In this work, we \ufb01rstly\npropose a heterogeneous graph learning to seamlessly integrate intra-graph and inter-graph reasoning\nto generate persuasive reasoning paths and bridge cross-domain semantic alignment.\n\n3 Methodology\n\nOur Heterogeneous Graph Learning (HGL) framework is shown in Figure 2. The HGL consists\nof a heterogeneous graph module (e.g. a primal VAHG module and a dual QAHG module) and\na contextual voting module (CVM). The heterogeneous graph module is to align semantic nodes\n\n3\n\n\u2026Candidate Answer RepresentationQuestion RepresentationVision RepresentationVision-to-AnswerHeterogeneous GraphQuestion-to-Answer Heterogeneous Graph\u2026\u2026GuidanceGuidanceReasoningReasoningParser(b) Heterogeneous Graph ModuleOutput Results:\u2713a) He is going to blow out the candles on the cake. b) He is going to jump over the fence and save the day. c) [person2] is going to throw the stone in his hand. d) He is going to hug [person2]. ClassificationQuestion: What is [person2] going to do next? Candidate Answersa) He is going to blow out the candles on the cake. b) He is going to jump over the fence and save the day. c) [person2] is going to throw the stone in his hand. d) He is going to hug [person2]. BERTRoiAlignment(a) Contextual Voting ModuleCNNImage\fFigure 3: Implementation details of the primal VAHG module and the dual QAHG module by taking\nthe representation of image, question and answer as inputs.\n\nbetween vision and linguistic domain. The goal of CVM is to exploit long-range visual context for\nbetter global reasoning.\n\ni }N\n\ni=1 of an input image I, the word set Q = {qi}M\ni=1 and V a = {va\ni }M\n\n3.1 De\ufb01nition\nGiven the object or region set O = {oi}N\ni=1 of a\nquery and the word set A = {ai}B\ni=1 of the candidate answers, we seek to construct heterogeneous\ngraph node set V o = {vo\ni=1, V q = {vq\ni=1, correspondingly. Each node\ni \u2208 {V o} corresponds to a visual object oi \u2208 O and the associated feature vector with d dimensions\nvo\ni \u2208 Rd. Similarly, the associated query word vector and associated answer word vector\nindicates vo\ni \u2208 Rd. By concatenating the joint embedding vi\ncan separately be formulated as vq\ntogether into a matrix X, we separately de\ufb01ne three matrices, such as vision matrix Xo \u2208 RN\u00d7d,\nquery matrix Xq \u2208 RM\u00d7d and answer matrix Xa \u2208 RB\u00d7d, where N, M and B denote separately\nthe visual object number, query word number and answer word number. We de\ufb01ne the \ufb01nal output\nof our framework as Yp \u2208 R4, which is a vector with 4 dimensions according to four-way multiple\nchoice of candidate answers as shown in Figure 2.\n\ni \u2208 Rd and va\n\ni }B\n\n3.2 Heterogeneous Graph Learning\n\n3.2.1 Vision-to-answer heterogeneous graph (VAHG) module\nThis module aims to align the semantics between vision and candidate answer domains via a\nheterogeneous graph reasoning, then generate a vision-to-answer guided representation Yv \u2208 RB\u00d7d\nfor classi\ufb01cation via a guidance mechanism (Figure 3). We \ufb01rstly introduce the reasoning of vision-\nto-answer heterogeneous graph, then show the guidance mechanism. We de\ufb01ne the vision-to-answer\nheterogeneous reasoning as:\n\n(1)\nwhere Yo \u2208 RB\u00d7d is the visual evolved representation and Wo \u2208 Rd\u00d7d is a trainable weighted\nmatrix. \u03b4 is a non-linear function. The A \u2208 RN\u00d7B is a heterogeneous adjacency weighted matrix by\ncalculating the accumulative weights of answer nodes to vision nodes, which is formulated as:\n\nYo = \u03b4(AT XoWo),\n\n(cid:80)\n\n(cid:48)\nexp(A\nij)\n(cid:48)\nij)\nij exp(A\n\nAij =\n\n(cid:48)\nij = voT\ni va\n, A\nj ,\nj . The A \u2208 RN\u00d7B is the\nwhere Aij \u2208 A is a scalar to indicate the correlations between vo\nheterogeneous adjacency weighted matrix normalized by using a softmax at each location. In this\nway, different heterogeneous node representation can adaptively propagate to each other.\nWe obtain the visual evolved representation Yo via vision-to-answer heterogeneous graph reasoning\n(Equation (1)). Given the Yo and answer matrix Xa \u2208 RB\u00d7d as input, we present a guidance\nmechanism for producing the vision-to-answer guided representation Y v. We divide the guidance\n\ni and va\n\n(2)\n\n4\n\n\ud835\udc34\ud835\udc35\u00d7\ud835\udc41softmax\ud835\udc4c\ud835\udc5c\ud835\udc35\u00d7\ud835\udc51\ud835\udc4a\ud835\udc5c\ud835\udc51\u00d7\ud835\udc51ReLUMLPsoftmaxMLP\ud835\udc4b\ud835\udc35+2\ud835\udc51\ud835\udc4b\ud835\udc5cN\u00d7\ud835\udc51\ud835\udc4b\ud835\udc4eB\u00d7\ud835\udc51\ud835\udc4b\ud835\udc4e\u2032B\u00d7\ud835\udc51\ud835\udc4b\ud835\udc5aB\u00d7\ud835\udc51\ud835\udc4b\ud835\udc5a\ud835\udc56\ud835\udc51\ud835\udc51\ud835\udc59\ud835\udc52B\u00d7\ud835\udc51\ud835\udc4a\ud835\udc5c\u2032\ud835\udc51\u00d7\ud835\udc51\ud835\udc4a\ud835\udc4e\ud835\udc51\u00d7\ud835\udc51\ud835\udc4a\ud835\udc51\u00d7\ud835\udc51\ud835\udc34\ud835\udc35\u00d7\ud835\udc40softmax\ud835\udc4c\ud835\udc5e\ud835\udc35\u00d7\ud835\udc51\ud835\udc4a\ud835\udc5e\ud835\udc51\u00d7\ud835\udc51ReLUMLPsoftmaxMLP\ud835\udc4b\ud835\udc35+2\ud835\udc51\ud835\udc4b\ud835\udc5eM\u00d7\ud835\udc51\ud835\udc4b\ud835\udc4eB\u00d7\ud835\udc51\ud835\udc4b\ud835\udc4e\u2032B\u00d7\ud835\udc51\ud835\udc4b\ud835\udc5aB\u00d7\ud835\udc51\ud835\udc4b\ud835\udc5a\ud835\udc56\ud835\udc51\ud835\udc51\ud835\udc59\ud835\udc52B\u00d7\ud835\udc51C\ud835\udc4c\ud835\udc63ReLUReLU\ud835\udc4a\ud835\udc5e\u2032\ud835\udc51\u00d7\ud835\udc51\ud835\udc4a\ud835\udc4e\ud835\udc51\u00d7\ud835\udc51\ud835\udc4a\ud835\udc51\u00d7\ud835\udc51\ud835\udc4c\ud835\udc5eReLUReLU\ud835\udc64\ud835\udc5c\ud835\udc64\ud835\udc5eFC\ud835\udc4c\ud835\udc4eCCVAHG ModuleQAHG ModuleConcatenationMatrix MultiplicationElement-wise Summation\fmechanism into two steps. We \ufb01rst generate a middle representation Xmiddle that is enhanced by\nword-level attention values, then we propose a \ufb01nal process to generate the target representation Yv.\nThe \ufb01rst step is formulated as following:\n\nXa(cid:48) = F (Xa),\n(cid:80)\n\nxm = anxa(cid:48), an =\n\nexp(Xa(cid:48)Wa(cid:48))\nn\u2208B exp(Xa(cid:48)Wa(cid:48))\n\n,\n\n(3)\n\n(4)\n\nXmiddle = f([Xm, Yo]),\n\n(5)\nwhere F is a MLP used to encode the answer feature, an is a word-level attention value by utilizing\nweighted product Wa(cid:48) on the encoded answer feature Xa(cid:48) with B number words. These attention\nvalues are normalized among all word-level linguistic representation via a softmax operation,which\nis shown in Equation (4). Then we apply the attention value an on xa(cid:48) \u2208 Xa(cid:48) to produce an attention\nvector xm \u2208 Xm. We concatenate the vector xm together into a matrix Xm that is enhanced by\nthe attention value. After that, to combine the relationship between Xm and Yo with Yo instead\nof simply combining Yo with Xm, we concatenate the Xm with Yo, then utilize a MLP f on\nthe concatenated result to get a middle representation Xmiddle \u2208 RB\u00d7d for generating our \ufb01nal\nvision-to-answer guided representation Yv.\nAt the second step, the Xmiddle and the visual evolved representation Yo are utilized for producing\nthe vision-to-answer guided representation Yv, which is formulated as:\n\nYv = \u03a8(\u03c6(YoWo(cid:48) + XmiddleWa)W),\n\n(6)\n\nwhere Wo(cid:48) and Wa are both learnable mapping matrixes to map the different embedding features\ninto a common space to better combine the Yo and Xmiddle, and the senior weighted matrix W is\nto map the combination result into the same dimensionality. The \u03a6 and \u03c6 are both vision-guided\nfunctions such as MLPs to get the \ufb01nal vision-to-answer guided representation Yv.\n\n3.2.2 Question-to-answer heterogeneous graph (QAHG) module\n\nIn this section, we produce a question-to-answer heterogeneous graph module that is similar to\nVAHG. The implementation details of the QAHG module are shown in Figure 3. This module aims\nto support the proper semantic alignment between question and answer domains. Given the query\nword vector and answer word vector as input, we aim to generate the question-to-answer guided\nrepresentation Yq \u2208 RB\u00d7d as the \ufb01nal output of this module. Speci\ufb01cally, taking the answer vector\nas input, we utilize a question-to-answer heterogeneous graph reasoning to produce a query evolved\nrepresentation Xq \u2208 RB\u00d7d. Then a symmetric guidance mechanism is utilized for generating the\nquestion-to-answer guided representation Yq.\nAfter getting the Yv and Yq from VAHG module and QAHG module, respectively, we utilize a\nparser at the end of the HGL framework as shown in Figure 2 to adaptively merge Yv and Yq to get\nan enhanced representation Ya for \ufb01nal classi\ufb01cation. The parser can be formulated as:\n\nYa = F (woYv + wqYq).\n\n(7)\n\nwhere wo and wq are derived from the original visual feature, query feature and answer feature to\ncalculate the importance of the task. We use a simple dot product to merge the two representations (Yv\nand Yq). Then we use linear mapping function F such as FC to produce the enhanced representation\nYa for \ufb01nal classi\ufb01cation. The wo and wq can be calculated as:\n\nwo =\n\nwq =\n\nexp(\u03d5v([Xo, Xa]Woa)\n\nexp(\u03d5v([Xo, Xa]Woa) + exp(\u03d5q([Xq, Xa]Wqa)\n\nexp(\u03d5q([Xq, Xa]Wqa)\n\nexp(\u03d5v([Xo, Xa]Woa) + exp(\u03d5q([Xq, Xa]Wqa)\n\n,\n\n,\n\n(8)\n\n(9)\n\nwhere Woa and Wqa are both trainable weighted matrices, and \u03d5v and \u03d5q indicate different MLP\nnetworks. [.] means concatenation operation.\n\n5\n\n\ff(cid:0)xl\n\n(cid:88)\n\n\u2200j\n\nyl\ni =\n\naxj\u2192xi =\n\n1\nC(x)\n\n(cid:80)\n\n(cid:1) g(cid:0)xl\n\nj\n\n(cid:1) ,\n\ni, xl\nj\n\ni, xl\nn \u03c6(xl\n\nexp(W aT\nn \u03c6(xl\nn\u2208N exp(W aT\niW a + xl\ni,\n\nj))\ni, xl\n\nj))\n\n(10)\n\n(11)\n\n,\n\n3.2.3 Contextual voting module\n\nThis module aims to replenish relevant scene context into local visual objects to give the objects with\na global perspective via a voting mechanism, which is formulated as:\n\ni, xl\n\nyl+1\ni = axj\u2192xi yl\n\ni denote the output and the input at position i of l-th convolution layer, and xl\n\n(12)\nwhere yl\nj denote\nthe input at position j in relevant image content. The Equation (11) represents that each output has\ncollected global input information from relevant positions. In Equation (12), axj\u2192xi denotes the\nlearnable voting weight from position xj to xi for adaptively select relevant contextual information\ninto local visual feature via weighted sum and element-wise product. The W a and W a\nn are both\ntrainable weights and \u03c6, f, g are non-linear functions with conv 1\u00d71 operation. The output of this\nmodule is the yl+1\n, which denotes the residual visual feature maps via residual addition between\ninput xl\n\ni and the enhanced feature axj\u2192xiyl\n\niW a.\n\ni\n\n4 Experiments\n\n4.1 Task Setup\n\nThe visual commonsense reasoning (VCR) task [44] is a new cognition-level reasoning consists\nof three sub-tasks: Q\u2192A, QA\u2192R and Q\u2192AR. The VCR is a four-way multi-choice task, and the\nmodel must choose the correct answer from four given answer choices for a question, and then select\nthe right rationale from four given rationale choices for that question and answer.\n\n4.2 Dataset and Evaluation\n\nWe carry out extensive experiments on VCR [44] benchmark, a representative large-scale visual\ncommonsense reasoning dataset containing a total of 290k multiple choice QA problems derived\nfrom 110k movie scenes. The dataset is of\ufb01cially split into a training set consisting of 80,418 images\nwith 212,923 questions, a validation set containing 9,929 images with 26,534 questions and a test set\nmade up of 9,557 with 25,263 queries. We follow this data partition in all experiments. The dataset is\nchallenge because of the complex and diverse language, multiple scenes and hard inference types as\nmentioned in [44]. Of note, unlike many VQA datasets wherein the answer is a single word, there\nare more than 7.5 words for average answer length and more than 16 words for average rationale\nlength. We strictly follow the data preprocessing and evaluation from [44] for fairly comparison.\n\n4.3\n\nImplementation\n\nWe conduct all experiments using 8 GeForce GTX TITAN XP cards on a single server. The batch\nsize is set to 96 with 12 images on each GPU. We strictly follow the baseline [44] to utilize the\nResNet-50 [16] and BERT [12] as our backbone and implement our proposed heterogeneous graph\nlearning on it in PyTorch [33]. The hyper-parameters in training mostly follow R2C [44]. We train\nour model by utilizing multi-class cross entropy between the prediction and label. Each task is trained\nseparately for question answering and answer reasoning via the same network. For all training,\nAdam [23] with weight decay of 0.0001 and beta of 0.9 is adopted to optimize all models. The\ninitial learning rate is 0.0002, reducing half (\u00d70.5) for two epochs when the validation accuracy is\nnot increasing. We train 20 epochs for all models from scratch in an end-to-end manner. Unless\notherwise noted, settings are the same for all experiments.\n\n4.4 Comparison with state-of-the-art\n\nQuantitative results. During this section, we show our state-of-the-art results of validation and test\non VCR [44] dataset with respect to three tasks in Table 1. Note the label of the test set is not available,\n\n6\n\n\fy BERT [12]\nl\nn\nO\n\nt\nx\ne\nT\n\nA\nQ\nV\n\nModel\n\nChance\n\nQA \u2192 R Q \u2192 AR\nQ \u2192 A\nVal Test Val Test Val Test\n25.0 25.0 25.0 25.0\n6.2\n53.8 53.9 64.1 64.5 34.8 35.0\n7.3\n45.8 45.9 55.0 55.1 25.3 25.6\n28.1 28.3 28.7 28.5\n8.4\n39.4 40.5 34.0 33.7 13.5 13.8\n42.8 44.1 25.1 25.1 10.7 11.0\n45.5 46.2 36.1 36.8 17.0 17.2\n44.4 45.5 32.0 32.2 14.6 14.6\n63.8 65.1 67.2 67.3 43.1 44.0\n69.4 70.1 70.6 70.8 49.1 49.8\n85.0\n\nBERT (response only) [44] 27.6 27.7 26.3 26.2\nESIM+ELMo [8]\nLSTM+ELMo [34]\nRevisitedVQA [19]\nBottomUpTopDown[2]\nMLB [22]\nMUTAN [4]\nR2C [44]\nHGL (Ours)\nHuman\n\n93.0\n\n6.2\n\n7.6\n\n8.3\n\n91.0\n\nTable 1: Main results of validation and test dataset on VCR with respect to three tasks. Note that we\ndo not need any extra information such as additional data or features.\n\nand we get the test predictions by submitting our results to a public leaderboard [44]. As can be\nseen, our HGL achieves an overall test accuracy of 70.1% compared to 65.1% by R2C [44] on Q\u2192A\ntask, 70.8% compared to 67.3% on QA\u2192R task, and 49.8% compared to 44.0% on Q\u2192AR task,\nrespectively. To compare with the state-of-the-art text only methods on Q\u2192A task, our HGL performs\n16.2% test accuracy improvement better than BERT [12], and even outperform ESIM+ELMo [8] by\n24.2% test accuracy. Compared with the several advanced methods of VQA, our model improves\naround 23.9% test accuracy at least on three tasks. The superior performance further demonstrates\nthe effectiveness of our model on the cognition-level task.\nQualitative results. Figure 4 shows the qualitative result of our HGL. It also shows the primal learned\nvision-to-answer heterogeneous graph (VAHG) and a dual learned question-to-answer heterogeneous\ngraph (QAHG) to further demonstrate the interpretability of our HGL. As shown in Figure 4, we\nutilize HGL on the four candidate responses to support proper semantic alignment between vision\nand linguistic domains. For better comprehensive analysis, we show the weighted connections of the\nVAHG and QAHG according to our correct predictions on different tasks. In Figure 4(d), our VAHG\nsuccessfully aligns the visual representation \"person5 (brown box)\" to the linguistic word \"witness\",\nand also successfully connects \"person1 (red box)\" with \"person5\" by the linguistic word \"witness\"\nto infer the right answer. Because the \"person5 (brown box)\" is the \"witness\" in this scenario.\nVisual representation \"person1 (red box)\" is connected with the emotional word \"angry\" to achieve\na heterogeneous relationship. Based on the right answer, in Figure 4(g), our QAHG can connect\nthe word \"angry\" with \"gritting his teeth\" for successfully reasoning the rationale. Moreover, in\nFigure 4(c), \"feeling\" from question can be aligned with the most suitable semantic word \"anger\" from\nthe answer for right answer prediction, which demonstrated the effectiveness of our QAHG. These\nresults can demonstrate that the VAHG and QAHG can really achieve proper semantic alignment\nbetween vision and linguistic domains for supporting cognition-level reasoning. The CVM is suitable\nto apply to visual context, because there are more information can be obtained from the visual\nevidence. For instance, the example of Figure 5, the felling of the \"person2\" must get information\nfrom visual evidence (e.g. raindrop) instead of the question to predict the right answer and reason.\nThe arrow is pointing at raindrop and snow. More results are shown in supplementary material.\n\n4.5 Ablation Studies\n\nThe effect of CVM. Note that our CVM learns a more enhanced object feature with a global\nperspective, respectively. The advantage of CVM is shown in Table 2, the CVM increases overall\nvalidation accuracy by 1.8% compared with baseline on Q\u2192A task, 1.2% on QA\u2192R and 3.1%\non Q\u2192AR. In Figure 5, the model w/ CVM shows the superior ability to successfully parse the\nsemantics of rain and snow in the image for better understand the rainy/snowy scene by highlighting\nthe relevant context as indicated by the red arrows in Figure 5(b).\n\n7\n\n\fQ \u2192 A QA \u2192 R Q \u2192 AR\n63.8\n65.6\n66.1\n66.4\n68.4\n67.8\n68.0\n69.4\n\n43.1\n45.4\n45.8\n46.4\n48.3\n48.2\n48.0\n49.1\n\nTable 2: Ablation studies for our HGL on three tasks over the validation set.\n\nModel\nBaseline\nBaseline w/ CVM\nBaseline w/ QAHG\nBaseline w/ VAHG\nHGL w/o CVM\nHGL w/o QAHG\nHGL w/o VAHG\nHGL\n\n67.2\n68.4\n68.2\n69.1\n69.7\n69.9\n68.8\n70.6\n\nFigure 4: Qualitative results of VAHG and QAHG. (a)(b)(e)(f) are the learned VAHG and QAHG\nof four-way multiple choices on answer task and reason task, respectively. (c)(d)(g)(h) show the\nweighted connection of VAHG and QAHG according to the prediction from four choices. The\npredicted result is shown as bold font, and the ground truth (GT) is shown as (cid:88). Please zoom in the\ncolored PDF version of this paper for more details.\n\nThe effect of QAHG. The effect of QAHG module is apparently to boost the validation accuracy by\naround 1.0% from baseline on all tasks. In Figure 4(c), the \"feeling\" from the question is connected\nwith the \"getting angry\" from the right answer choice, and the \"getting angry\" from the question\nis connected with \"gritting his teeth\" from the rationale in Figure 4(g), which demonstrates the\neffectiveness of QAHG that can generate persuasive reasoning paths.\nThe effect of VAHG. We analyze the effect of VAHG on VCR. The VAHG module can promote\nthe baseline by 2.6% (Q\u2192A), 2.1% (QA\u2192R) and 3.3% (Q\u2192AR) accuracy. In Figure 4(h), the\nvisual representation \"person1 (red box)\" is semantically aligned to the word \"person1\". The visual\nrepresentation \"person5 (brown box)\" is semantically aligned to the word \"witness\" in Figure 4(d),\nand \"person1 (red box)\" and \"person5 (brown box)\" are connected by the word \"witness\". Based on\nthese relationships, the HGL can re\ufb01ne reasoning paths for semantic agreement.\nHGL w/o CVM. As can be seen in Table 2, the effect of combination between VAHG module and\nQAHG module can reach a high performance to 68.4%, 69.7% and 48.3%, which can demonstrate\nthe effectiveness of building VAHG and QAHG to bridge vision and language domains.\nHGL w/o QAHG. The proposed CVM collaborated with VAHG module is evaluated on three\ntasks and performs 67.8%, 69.9% and 48.2%, correspondingly. It can support the feasibility of\nincorporating CVM with VAHG module.\nHGL w/o VAHG. The ability of CVM+QAHG is shown in Table 2, which gets great scores on\noverall tasks as validating the availability of the combination.\n\n8\n\nhowisperson1feelingperson1gettingangryatthewitnessperson1gettingangryatthewitnessgettingangryat\u2026person1grittinghisteeth\u2026person1grittinghisteethhas\u2026isis\u2026isis(a) QAHG(b) VAHG(e) QAHG(f) VAHG(c) QAHG of right choice(d) VAHG of right choice(g) QAHG of right choice(h) VAHG of right choiceAnswerReasonQ: How is [person1] feeling? a)[person5] is feeling very apprehensive and scared. b)[person3] is feeling happy. c)[person1] is feeling tired from the trip. d)[person1] is getting angry at the witness. \u2713R: d) is right because\u2026a)[person1]'s glaring eyes and the tight set of his jaw and mouth suggest anger. b)This is a courtroom and [person3] is probably a lawyer. He is looking towards the middle and not the side which means he is probably talking to the judge and not the witness. c)[person1] has an angry look on his face, and is moving his mouth in a way that looks like he is shouting, this look is typical of one who is angry at another and is verbally challenging them. d)[person1] is gritting his teeth. [person1] has a look of pure anger on his face. \u2713\fFigure 5: Qualitative results of our CVM. (a) The model w/o CVM. (b) The model w/ CVM.\n\n5 Conclusion\nIn this paper, we proposed a novel heterogeneous graph learning framework called HGL for seamlessly\nintegrating the intra-graph and inter-graph reasoning in order to bridge the proper semantic alignment\nbetween vision and linguistic domains, which contains a dual heterogeneous graph module including\na vision-to-answer heterogeneous graph module and a question-to-answer heterogeneous graph\nmodule. Furthermore, the HGL integrates a contextual voting module to exploit long-range visual\ncontext for better global reasoning.\n\n6 Acknowledgements\n\nThis work was supported in part by the National Natural Science Foundation of China (NSFC) under\nGrant No.1811461, in part by the National Natural Science Foundation of China (NSFC) under Grant\nNo.61976233, and in part by the Natural Science Foundation of Guangdong Province, China under\nGrant No.2018B030312002.\n\nReferences\n[1] S. Abu-El-Haija, B. Perozzi, and R. Al-Rfou. Learning edge representations via low-rank\nasymmetric projections. In Proceedings of the 2017 ACM on Conference on Information and\nKnowledge Management, pages 1787\u20131796. ACM, 2017.\n\n[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and\ntop-down attention for image captioning and visual question answering. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 6077\u20136086, 2018.\n\n[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa:\nVisual question answering. In Proceedings of the IEEE international conference on computer\nvision, pages 2425\u20132433, 2015.\n\n[4] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodal tucker fusion for\nvisual question answering. In Proceedings of the IEEE international conference on computer\nvision, pages 2612\u20132620, 2017.\n\n[5] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected\n\nnetworks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[6] S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations. In Thirtieth\n\nAAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[7] Y. Cao, M. Fang, and D. Tao. Bag: Bi-directional attention entity graph convolutional network\n\nfor multi-hop reasoning question answering. arXiv preprint arXiv:1904.04969, 2019.\n\n9\n\nQ: What if [person2] fell? A: Person2 would get wet. R: Preson2 is surrounded by water.(a)(b)Q: Is it snowing outside?A: Yes, it is snowing. R: [person4] is dressed in a hat, scarf and a big jacket, his hat and shoulders are covered in white snowflakes. \f[8] Q. Chen, X.-D. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen. Enhanced lstm for natural\n\nlanguage inference. In ACL, 2017.\n\n[9] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual\ndialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 326\u2013335, 2017.\n\n[10] A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents\nwith deep reinforcement learning. In Proceedings of the IEEE International Conference on\nComputer Vision, pages 2951\u20132960, 2017.\n\n[11] C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan. Visual grounding via accumulated attention.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n7746\u20137755, 2018.\n\n[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional\n\ntransformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[13] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,\nand R. P. Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\nAdvances in neural information processing systems, pages 2224\u20132232, 2015.\n\n[14] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings\nof the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 855\u2013864. ACM, 2016.\n\n[15] D. Guo, C. Xu, and D. Tao. Image-question-answer synergistic network for visual dialog. arXiv\n\npreprint arXiv:1902.09774, 2019.\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[17] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential\nexpressions with compositional modular networks. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 1115\u20131124, 2017.\n\n[18] D. A. Hudson and C. D. Manning. Gqa: a new dataset for compositional question answering\n\nover real-world images. arXiv preprint arXiv:1902.09506, 2019.\n\n[19] A. Jabri, A. Joulin, and L. Van Der Maaten. Revisiting visual question answering baselines. In\n\nEuropean conference on computer vision, pages 727\u2013739. Springer, 2016.\n\n[20] U. Jain, Z. Zhang, and A. G. Schwing. Creativity: Generating diverse questions using varia-\ntional autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 6485\u20136494, 2017.\n\n[21] L. Ke, X. Li, Y. Bisk, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, and S. Srinivasa. Tactical\nrewind: Self-correction via backtracking in vision-and-language navigation. arXiv preprint\narXiv:1903.02547, 2019.\n\n[22] J.-H. Kim, K. W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard Product for\nLow-rank Bilinear Pooling. In The 5th International Conference on Learning Representations,\n2017.\n\n[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[24] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn International Conference on Learning Representations (ICLR), 2017.\n\n[25] R. Krishna, M. Bernstein, and L. Fei-Fei. Information maximizing visual question generation.\n\narXiv preprint arXiv:1903.11207, 2019.\n\n[26] Y. Li, W. Ouyang, B. Zhou, Y. Cui, J. Shi, and X. Wang. Factorizable net: An ef\ufb01cient\nsubgraph-based framework for scene graph generation. arXiv preprint arXiv:1806.11538, 2018.\n\n10\n\n\f[27] X. Liang, H. Zhou, and E. Xing. Dynamic-structured semantic propagation network. In CVPR,\n\npages 752\u2013761, 2018.\n\n[28] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language\n\npriors. In ECCV, pages 852\u2013869. Springer, 2016.\n\n[29] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual\nquestion answering. In Advances In Neural Information Processing Systems, pages 289\u2013297,\n2016.\n\n[30] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating\n\nnatural questions about an image. arXiv preprint arXiv:1603.06059, 2016.\n\n[31] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In\n\nICML, pages 2014\u20132023, 2016.\n\n[32] W. Norcliffe-Brown, S. Vafeias, and S. Parisot. Learning conditioned graph structures for\ninterpretable visual question answering. In Advances in Neural Information Processing Systems,\npages 8334\u20138343, 2018.\n\n[33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.\n\n[34] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep\n\ncontextualized word representations. arXiv preprint arXiv:1802.05365, 2018.\n\n[35] I. Schwartz, A. Schwing, and T. Hazan. High-order attention models for visual question\nanswering. In Advances in Neural Information Processing Systems, pages 3664\u20133674, 2017.\n\n[36] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n4613\u20134621, 2016.\n\n[37] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu. Learning to compose dynamic tree structures\n\nfor visual contexts. arXiv preprint arXiv:1812.01880, 2018.\n\n[38] D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In Proceedings of the\n22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages\n1225\u20131234. ACM, 2016.\n\n[39] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Y. Wang, and L. Zhang.\nReinforced cross-modal matching and self-supervised imitation learning for vision-language\nnavigation. arXiv preprint arXiv:1811.10092, 2018.\n\n[40] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel. Image captioning and visual\nquestion answering based on attributes and external knowledge. IEEE transactions on pattern\nanalysis and machine intelligence, 40(6):1367\u20131381, 2018.\n\n[41] F. Xiao, L. Sigal, and Y. Jae Lee. Weakly-supervised visual grounding of phrases with linguistic\nstructures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 5945\u20135954, 2017.\n\n[42] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio.\nShow, attend and tell: Neural image caption generation with visual attention. In ICML, pages\n2048\u20132057, 2015.\n\n[43] W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learning of deformable mixture of parts\n\nand deep convolutional neural networks for human pose estimation. In CVPR, 2016.\n\n[44] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. From recognition to cognition: Visual commonsense\nreasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June\n2019.\n\n11\n\n\f", "award": [], "sourceid": 1586, "authors": [{"given_name": "Weijiang", "family_name": "Yu", "institution": "Sun Yat-sen University"}, {"given_name": "Jingwen", "family_name": "Zhou", "institution": "Sun Yat-sen University"}, {"given_name": "Weihao", "family_name": "Yu", "institution": "Sun Yat-sen University"}, {"given_name": "Xiaodan", "family_name": "Liang", "institution": "Sun Yat-sen University"}, {"given_name": "Nong", "family_name": "Xiao", "institution": "Sun Yat-sen University"}]}