{"title": "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 6850, "page_last": 6860, "abstract": "In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities. We evaluate the proposed approach on two representative vision-and-language grounding tasks, i.e., image captioning and visual question answering. In both tasks, the semantic-grounded image representations consistently boost the performance of the baseline models under all metrics across the board. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image-related applications. (The code is available at \\url{https://github.com/fenglinliu98/MIA)", "full_text": "Aligning Visual Regions and Textual Concepts for\n\nSemantic-Grounded Image Representations\n\nFenglin Liu1\u2217, Yuanxin Liu3,4\u2217, Xuancheng Ren2\u2217, Xiaodong He5, Xu Sun2\n\n1ADSPLAB, School of ECE, Peking University, Shenzhen, China\n\n2MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University\n\n3Institute of Information Engineering, Chinese Academy of Sciences\n4School of Cyber Security, University of Chinese Academy of Sciences\n\n{fenglinliu98, renxc, xusun}@pku.edu.cn, liuyuanxin@iie.ac.cn\n\n5JD AI Research\n\nxiaodong.he@jd.com\n\nAbstract\n\nIn vision-and-language grounding problems, \ufb01ne-grained representations of the\nimage are considered to be of paramount importance. Most of the current systems\nincorporate visual features and textual concepts as a sketch of an image. However,\nplainly inferred representations are usually undesirable in that they are composed\nof separate components, the relations of which are elusive. In this work, we aim at\nrepresenting an image with a set of integrated visual regions and corresponding\ntextual concepts, re\ufb02ecting certain semantics. To this end, we build the Mutual\nIterative Attention (MIA) module, which integrates correlated visual features and\ntextual concepts, respectively, by aligning the two modalities. We evaluate the\nproposed approach on two representative vision-and-language grounding tasks,\ni.e., image captioning and visual question answering. In both tasks, the semantic-\ngrounded image representations consistently boost the performance of the baseline\nmodels under all metrics across the board. The results demonstrate that our ap-\nproach is effective and generalizes well to a wide range of models for image-related\napplications.2\n\n1\n\nIntroduction\n\nRecently, there is a surge of research interest in multidisciplinary tasks such as image captioning\n[7] and visual question answering (VQA) [3], trying to explain the interaction between vision and\nlanguage. In image captioning, an intelligence system takes an image as input and generates a\ndescription in the form of natural language. VQA is a more challenging problem that takes an extra\nquestion into account and requires the model to give an answer depending on both the image and\nthe question. Despite their different application scenarios, a shared goal is to understand the image,\nwhich necessitates the acquisition of grounded image representations.\nIn the literature, an image is typically represented in two fundamental forms: visual features and\ntextual concepts (see Figure 1). Visual Features [30, 2, 18] represent an image in the vision domain\nand contain abundant visual information. For CNN-based visual features, an image is split into\nequally-sized visual regions without encoding global relationships such as position and adjacency. To\nobtain better image representations with respect to concrete objects, RCNN-based visual features that\nare de\ufb01ned by bounding boxes of interests are proposed. Nevertheless, the visual features are based\n\n\u2217Equal contribution.\n2The code is available at https://github.com/fenglinliu98/MIA\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustrations of commonly-used image representations (from left to right): CNN-based grid\nvisual features, RCNN-based region visual features, textual concepts, and scene-graphs.\n\non regions and are not associated with the actual words, which means the semantic inconsistency\nbetween the two domains has to be resolved by the downstream systems themselves. Textual\nConcepts [8, 35, 31] represent an image in the language domain and introduce semantic information.\nThey consist of unordered visual words, irrespective of af\ufb01liation and positional relations, making it\ndif\ufb01cult for the system to infer the underlying semantic and spatial relationships. Moreover, due to\nthe lack of visual reference, some concepts may induce semantic ambiguity, e.g., the word mouse can\neither refer to a mammal or an electronic device. Scene-Graphs [34] are the combination of the two\nkinds of representations. They use region-based visual features to represent the objects and textual\nconcepts to represent the relationships. However, to construct a scene-graph, a complicated pipeline\nis required and error propagation cannot be avoided.\nFor image representations used for text-oriented purposes, it is often desirable to integrate the two\nforms of image information. Existing downstream systems achieve that by using both kinds of image\nrepresentations in the decoding process, mostly ignoring the innate alignment between the modalities.\nAs the semantics of the visual features and the textual concepts are usually inconsistent, the systems\nhave to devote themselves to learn such alignment. Besides, these representations only contain local\nfeatures, lacking global structural information. Those problems make it hard for the systems to\nunderstand the image ef\ufb01ciently.\nIn this paper, we work toward constructing integrated image representations from vision and language\nin the encoding process. The objective is achieved by the proposed Mutual Iterative Attention (MIA)\nmodule, which aligns the visual features and textual concepts with their relevant counterparts in\neach domain. The motivation comes from the fact that correlated features in one domain can be\nlinked up by a feature in another domain, which has connections with all of them. In implementation,\nwe perform mutual attention iteratively between the two domains to realize the procedure without\nannotated alignment data. The visual receptive \ufb01elds gradually concentrate on salient visual regions,\nand the original word-level concepts are gradually merged to recapitulate corresponding visual\nregions. In addition, the aligned visual features and textual concepts provide a more clear de\ufb01nition\nof the image aspects they represent.\nThe contributions of this paper are as follows:\n\n\u2022 For vision-and-language grounding problems, we introduce integrated image representations\nbased on the alignment between visual regions and textual concepts to describe the salient\ncombination of local features in a certain modality.\n\n\u2022 We propose a novel attention-based strategy, namely the Mutual Iterative Attention (MIA),\nwhich uses the features from the other domain as the guide for integrating the features in the\ncurrent domain without mixing in the heterogeneous information.\n\n\u2022 According to the extensive experiments on the MSCOCO image captioning dataset and VQA\nv2.0 dataset, when equipped with the MIA, improvements on the baselines are witnessed in all\nmetrics. This demonstrates that the semantic-grounded image representations are effective and\ncan generalize to a wide range of models.\n\n2 Approach\n\nThe proposed approach acts on plainly extracted image features from vision and language, e.g.,\nconvolutional feature maps, regions of interest (RoI), and visual words (textual concepts), and re\ufb01nes\n\n2\n\ncatssittingtelevisionwatching layingroomtablebrowntvstuffedincatroomtelevisionfloorcatssittingtelevisionwatching layingroomtablebrowntvstuffedcatssittingtelevisionwatchinglayingroomtablebrowntvstuffedMIAMIA(a)(b)(c)(d)(e)\fthose features so that they can describe visual semantics, i.e., meaningful compositions of such\nfeatures, which are then used in the downstream tasks to replace the original features. Figure 2 gives\nan overview and an example of our approach.\n\n2.1 Visual Features and Textual Concepts\n\nVisual features and textual concepts are widely used\n[35, 33, 15, 21] as the information sources for image-\ngrounded text generation. In common practice, visual fea-\ntures are extracted by ResNet [10], GoogLeNet [27] and\nVGG [25], and are rich in low-level visual information [31].\nRecently, more and more work adopted regions of interest\n(RoI) proposed by RCNN-like models as visual features,\nand each RoI is supposed to contain a speci\ufb01c object in\nthe image. Textual concepts are introduced to compensate\nthe lack of high-level semantic information in visual fea-\ntures [8, 31, 35]. Speci\ufb01cally, they consist of visual words\nthat can be objects (e.g., dog, bike), attributes (e.g., young,\nblack) and relations (e.g., sitting, holding). The embedding\nvectors of these visual words are then taken as the textual\nconcepts. It is worth noticing that to obtain visual features\nand textual concepts, only the image itself is needed as\ninput, and no external text information about the image is\nrequired, meaning that they can be used for any vision-and-\nlanguage grounding problems. In the following, we denote\nthe visual features and textual concepts for an image as I\nand T , respectively.\n\n2.2 Learning Alignment\n\nTo form the alignment between the visual regions and the\ntextual words, we adopt the attention mechanism from\nVaswani et al. [28], which is designed initially to obtain\ncontextual representations for sentences in machine trans-\nlation and has proven to be effective in capturing alignment\nof different languages and structure of sentences.\n\n2.2.1 Mutual Attention\n\nFigure 2: Overview of our approach.\nWe take as input visual features and tex-\ntual concepts (the lower) and repeat a\nmutual attention mechanism (the mid-\ndle) to combine the local features from\neach domain, resulting in integrated im-\nage representations re\ufb02ecting certain se-\nmantics of the image (the upper).\n\nMutual Attention contains two sub-layers. The \ufb01rst sub-layer makes use of multi-head attention to\nlearn the correlated features in a certain domain by querying the other domain. The second sub-layer\nuses feed-forward layer to add suf\ufb01cient expressive power.\nThe multi-head attention is composed of k parallel heads. Each head is formulated as a scaled\ndot-product attention:\n\n(cid:32)\n\nQW Q\ni (SW K\n\u221a\n\ni )T\n\ndk\n\n(cid:33)\n\nAtti(Q, S) = softmax\n\nSW V\ni ,\n\ni = 1, . . . , k\n\n(1)\n\nwhere Q \u2208 Rm\u00d7dh and S \u2208 Rn\u00d7dh stand for m querying features and n source features, respectively;\ni \u2208 Rdh\u00d7dk are learnable parameters of linear transformations; dh is the size of the\nW Q\ninput features and dk = dh/k is the size of the output features for each attention head. Results from\neach head are concatenated and passed through a linear transformation to construct the output:\n\ni , W K\ni\n\n, W V\n\nMultiHeadAtt(Q, S) = [Att1(Q, S), . . . , Attk(Q, S)]W O\n\n(2)\nwhere W O \u2208 Rdh\u00d7dh is the parameter to be learned. The multi-head attention integrates n source\nfeatures into m output features in the order of querying features. To simplify computation, we keep\nm the same as n.\n\n3\n\ncouchredroomsittingbeddogblacklayingwhitesmallremotehandholdingblackcontrollerwomansittinggirlshirtyoungMulti-HeadAttentionAdd & NormAdd & NormFeedForwardAdd & NormMulti-HeadAttentionAdd & NormAdd & NormFeedForwardAdd & Norm\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u201cwoman\u201d, \u201ccat\u201d, \u201csitting\u201d, \u201cred\u201d, \u201cbed\u201d, \u201cgirl\u201d, \u201cdog\u201d, \u201cholding\u201d, \u201cshirt\u201d, \u201claying\u201d, \u201cblack\u201d, \u201cyoung\u201d, \u00b7\u00b7\u00b7 ,\"remote\", \"room\", ,\"white\", \"small\", \"controller\", \"couch\", \"wearing\", \"hair\", \"wii\", \"living\",\"chair\",\"hand\"N\u00d7IT\u00b7\u00b7\u00b7NN\fFollowing the multi-head attention is a fully-connected network, de\ufb01ned as:\n\n(cid:16)\n\n0, XW (1) + b(1)(cid:17)\n\nFCN(X) = max\n\nW (2) + b(2)\n\n(3)\n\nwhere W (1) and W (2) are matrices for linear transformation; b(1) and b(2) are the bias terms. Each\nsub-layer is followed by an operation sequence of dropout [26], shortcut connection3 [10], and layer\nnormalization [4].\nFinally, the mutual attention is conducted as:\n\nI(cid:48) = FCN(MultiHeadAtt(T, I)), T (cid:48) = FCN(MultiHeadAtt(I(cid:48), T ))\n\n(4)\n\ni.e., visual features are \ufb01rst integrated according to textual concepts, and then textual concepts are\nintegrated according to integrated visual features. It is worth noticing that it is also possible to reverse\nthe order by \ufb01rst constructing correlated textual concepts. However, in our preliminary experiments,\nwe found that the presented order performs better. The related results and explanations are given in\nthe supplementary materials for reference.\nThe knowledge from either domain can serve as the guide for combining local features and extracting\nstructural relationships of the other domain. For example, as shown by the upper left instance of\nthe four instances in Figure 2, the textual concept woman integrates the regions that include the\nwoman, which then draw in textual concepts sitting, girl, shirt, young. In addition, the mutual\nattention aligns the two kinds of features, because for the same position in the two feature matrices,\nthe integrated visual feature and the integrated textual concept are co-referential and represent the\nsame high-level visual semantics. This approach also ensures that the re\ufb01ned visual features only\ncontain homogeneous information because the information from the other domain only serves as the\nattentive weight and is not part of the \ufb01nal values.\n\n2.2.2 Mutual Iterative Attention\n\nTo re\ufb01ne both the visual features and the textual concepts, we propose to perform mutual attention\niteratively. The process in Eq. (4) that uses the original features is considered as the \ufb01rst round:\n\nI1 = FCN(MultiHeadAtt(T0, I0)), T1 = FCN(MultiHeadAtt(I1, T0))\n\n(5)\nwhere I0, T0, I1 and T1 represent the original visual features, the original textual concepts, the macro\nvisual features, and the macro textual concepts, respectively. By repeating the same process for N\ntimes, we obtain the \ufb01nal outputs of the two stacks:\n\nIN = FCN(MultiHeadAtt(TN\u22121, IN\u22121)), TN = FCN(MultiHeadAtt(IN , TN\u22121))\n\n(6)\n\nIt is important to note that in each iteration, the parameters of the mutual attention are shared. However,\nas in each iteration more information is integrated into each feature, it is possible that iterating too\nmany times would cause the over-smoothing problem that all features represent essentially the same\nand the overall semantics of the image. To avoid such problem, we apply the aforementioned post-\nprocessing operations to the output of each layer, but with the shortcut connection from the input of\neach layer (not the sub-layer). The shortcut serves as a semantic anchor that prevents the peripheral\ninformation from extending the pivotal visual or textual features too much and keeps the position of\neach semantic-grounded feature stable in the feature matrices.\nFor the downstream tasks consuming both visual features and textual concepts of images, IN and\nTN can be directly used to replace the original features, respectively, because the number and the\nsize of the features are kept through the procedure. However, since the visual features and the textual\nconcepts are already aligned, we can directly add them up to get the output that makes the best\nof their respective advantages, even for the tasks that originally only consumes one kind of image\nrepresentations:\n\n(7)\nAs a result, the re\ufb01ned features overcome the aforementioned weaknesses of existing image represen-\ntations, providing a better start point for downstream tasks. For tasks using both kinds features, each\nkind feature can be replaced with MIA-re\ufb01ned features.\n\nMIA(I, T ) = LayerNorm(IN + TN )\n\n3We build the shortcut connection by adding the source features to the sub-layer outputs, instead of the\n\nquerying features in Vaswani et al. [28], to ensure no heterogeneous information is injected.\n\n4\n\n\fFigure 3: Illustration of how to equip the baseline models with our MIA. MIA aligns and integrates\nthe original image representations from two modalities. Left: For image captioning, the semantic-\ngrounded image representations are used to replace both kinds of original image features. Right: For\nVQA, MIA only substitutes the image representations, and the question representations are preserved.\n\nAs annotated alignment data is not easy to obtain and the alignment learning lacks direct supervision,\nwe adopt the distantly-supervised learning and re\ufb01ne the integrated image representations with\ndownstream tasks. As shown by previous work [28], when trained on machine translation, the\nattention can learn correlation of words quite well. As the proposed method focuses on building\nsemantic-grounded image representations, it can be easily incorporated in the downstream models\nto substitute the original image representations, which in turn provides supervision for the mutual\niterative attention. Speci\ufb01cally, we experiment with the task of image captioning and VQA. To use\nthe proposed approach, MIA is added to the downstream models as a preprocessing component.\nFigure 3 illustrates how to equip the baseline systems with MIA, through two examples for image\ncaptioning and VQA, respectively. As we can see, MIA substitutes the original image representations\nwith semantic-grounded image representations. For VQA, the question representations are preserved.\nBesides, MIA does not affect the original experimental settings and training strategies.\n\n3 Experiment\n\nWe evaluate the proposed approach on two multi-modal tasks, i.e., image captioning and visual\nquestion answering (VQA). We \ufb01rst conduct experiments on representative systems that use different\nkinds of image representations to demonstrate the effectiveness of the proposed semantic-grounded\nimage representations, and then provide analysis of the key components of the MIA module.\nBefore introducing the results and the analysis, we \ufb01rst describe some common settings. The\nproposed MIA relies on both visual features and textual concepts to produce semantic-grounded\nimage representations. Considering the diverse forms of the original image representations, unless\notherwise speci\ufb01ed, they are obtained as follows: (1) the grid visual features are from a ResNet-152\npretrained on ImageNet, (2) the region-based visual features are from a variant of Faster R-CNN [23],\nwhich is provided by Anderson et al. [2] and pre-trained on Visual Genome [13], and (3) the textual\nconcepts are extracted by a concept extractor in Fang et al. [8] trained on the MSCOCO captioning\ndataset using Multiple Instance Learning [36]. The number of textual concepts is kept the same as\nthe visual features, i.e., 49 for grid visual features and 36 for region visual features, by keeping only\nthe top concepts. The settings of MIA are the same for the two tasks, which re\ufb02ects the generality\nof our method. Particularly, we use 8 heads (k = 8) and iterate twice (N = 2), according to the\nperformance on the validation set. For detailed settings, please refer to the supplementary material.\n\n3.1\n\nImage Captioning\n\nDataset and Evaluation Metrics. We conduct experiments on the MSCOCO image captioning\ndataset [7] and use SPICE [1], CIDEr [29], BLEU [22], METEOR [5] and ROUGE [14] as evaluation\nmetrics, which are calculated by MSCOCO captioning evaluation toolkit [7]. Please note that\nfollowing common practice [17, 2, 16], we adopt the dataset split from Karpathy and Li [11] and the\nresults are not comparable to those from the online MSCOCO evaluation server.\nBaselines. Given an image, the image captioning task aims to generate a descriptive sentence\naccordingly. To evaluate how the proposed semantic-grounded image representation helps the\ndownstream tasks, we \ufb01rst design \ufb01ve representative baseline models that take as input different\nimage representations based on previous work. They are (1) Visual Attention, which uses grid visual\n\n5\n\nInput ImageConceptExtractorMIACaptionResNet-152 /Faster R-CNNFasterR-CNNInput ImageInput QuestionGRUQuestion EmbeddingConceptExtractorMIAAnswerFasterR-CNNInput ImageInput QuestionGRUQuestion EmbeddingVQA DecoderAnswerVisual FeaturesTextual ConceptsLSTM DecoderVisual FeaturesCaptionLSTM DecoderVQA DecoderSemantic-Grounded Image RepresentationsSemantic-Grounded Image RepresentationsInput ImageConceptExtractorResNet-152 /Faster R-CNN\fTable 1: Results of the representative systems on the image captioning task.\n\nMethods\nVisual Attention\nw/ MIA\nConcept Attention\nw/ MIA\nVisual Condition\nw/ MIA\nConcept Condition\nw/ MIA\nVisual Regional Attention\nw/ MIA\n\nBLEU-1\n\nBLEU-2\n\nBLEU-3\n\nBLEU-4 METEOR\n\nROUGE\n\n72.6\n74.5\n72.6\n73.8\n73.3\n73.9\n72.9\n73.9\n75.2\n75.6\n\n56.0\n58.4\n55.9\n57.4\n56.9\n57.3\n56.2\n57.3\n58.9\n59.4\n\n42.2\n44.4\n42.5\n43.8\n43.4\n43.9\n42.8\n43.9\n45.2\n45.7\n\n31.7\n33.6\n32.5\n33.6\n33.0\n33.7\n32.7\n33.7\n34.7\n35.4\n\n26.5\n26.8\n26.5\n27.1\n26.8\n26.9\n26.4\n26.9\n27.6\n28.0\n\n54.6\n55.8\n54.4\n55.3\n54.8\n55.1\n54.4\n55.1\n56.0\n56.4\n\nCIDEr\n103.0\n106.7\n103.2\n107.9\n105.2\n107.2\n104.4\n107.2\n111.2\n114.1\n\nSPICE\n19.3\n20.1\n19.4\n20.3\n19.5\n19.8\n19.3\n19.8\n20.6\n21.1\n\nTable 2: Evaluation of systems that use reinforcement\nlearning on the MSCOCO image captioning dataset.\n\nTable 3: The overall accuracy on the\nVQA v2.0 test dataset.\n\nMethods\nUp-Down\nw/ MIA\nTransformer\nw/ MIA\n\nBLEU-4 METEOR\n\nROUGE\n\n36.5\n37.0\n39.0\n39.5\n\n28.0\n28.2\n28.4\n29.0\n\n57.0\n57.4\n58.6\n58.7\n\nCIDEr\n120.9\n122.2\n126.3\n129.6\n\nSPICE\n21.5\n21.7\n21.7\n22.7\n\nMethods\nUp-Down\nw/ MIA\nBAN\nw/ MIA\n\nTest-dev\n\nTest-std\n\n67.3\n68.8\n69.6\n70.2\n\n67.5\n69.1\n69.8\n70.3\n\nfeatures as the attention source for each decoding step, (2) Concept Attention, which uses textual\nconcepts as the attention source, (3) Visual Condition, which takes textual concepts as extra input at\nthe \ufb01rst decoding step but grid visual features in the following decoding steps, (4) Concept Condition,\nwhich, in contrast to Visual Condition, takes grid visual features at the \ufb01rst decoding step but textual\nconcepts in the following decoding steps, and (5) Visual Regional Attention, which uses region-based\nvisual features as the attention source. For those models, the traditional cross-entropy based training\nobjective is used. We also check on the effect of MIA on more advanced captioning models, including\n(6) Up-Down [2], which uses region-based visual features, and (7) Transformer, which adapts the\nTransformer-Base model in Vaswani et al. [28] by taking the region-based visual features as input.\nThose advanced models adopt CIDEr-based training objective using reinforcement training [24].\nResults. In Table 1, we can see that the models enjoy an increase of 2%\u223c5% in terms of both\nSPICE and CIDEr, with the proposed MIA. Especially, \u201cVisual Attention w/ MIA\u201d and \u201cConcept\nAttention w/ MIA\u201d are able to pay attention to integrated representation collections instead of the\nseparate grid visual features or textual concepts. Besides, the baselines also enjoy the bene\ufb01t from\nthe semantic-grounded image representations, which can be veri\ufb01ed by the improvement of \u201cVisual\nRegional Attention w/ MIA\u201d. The results demonstrate the effectiveness and universality of MIA. As\nshown in Table 2, the proposed method can still bring improvements to the strong baselines under\nthe reinforcement learning settings. Besides, it also suggests that our approach is compatible with\nboth the RNN based (Up-Down) and self-attention based (Transformer) language generators. We\nalso investigate the effect of incorporating MIA with the scene-graph based model [32], the results\nare provided in the supplementary material, where we can also see consistent improvements. In all,\nthe baselines are promoted in all metrics across the board, which indicates that the re\ufb01ned image\nrepresentations are less prone to the variations of model structures (e.g., with or without attention,\nand the architecture of downstream language generator), hyper-parameters (e.g., learning rate and\nbatch size), original image representations (e.g., CNN, RCNN-based visual features, textual concepts\nand scene-graphs), and learning paradigm (e.g., cross-entropy and CIDEr based objective).\n\n3.2 Visual Question Answering\n\nDataset and Evaluation Metrics.\nWe experiment on the VQA v2.0 dataset [9], which is comprised of image-based question-answer\npairs labeled by human annotators. The questions are categorized into three types, namely Yes/No,\n\n6\n\n\fTable 4: Ablation analysis of the proposed approach. As we can see, incorporating MIA-re\ufb01ned\nimage representation from a single modality can also lead to overall improvements.\n\nMethods\nVisual Attention\nw/ IN\nw/ MIA\nConcept Attention\nw/ TN\nw/ MIA\n\nBLEU-1\n\nBLEU-2\n\nBLEU-3\n\nBLEU-4 METEOR\n\nROUGE\n\n72.6\n74.7\n74.5\n72.6\n73.7\n73.8\n\n56.0\n58.5\n58.4\n55.9\n57.0\n57.4\n\n42.2\n44.6\n44.4\n42.5\n43.4\n43.8\n\n31.7\n33.7\n33.6\n32.5\n33.1\n33.6\n\n26.5\n26.5\n26.8\n26.5\n26.8\n27.1\n\n54.6\n55.2\n55.8\n54.4\n55.0\n55.3\n\nCIDEr\n103.0\n105.7\n106.7\n103.2\n106.5\n107.9\n\nSPICE\n19.3\n19.6\n20.1\n19.4\n20.0\n20.3\n\nFigure 4: Model performance variation under different metrics with the increase of iteration times.\nVA and CA stand for Visual Attention and Concept Attention, respectively.\n\nNumber and other categories. We report the model performance based on overall accuracy on both\nthe test-dev and test-std sets, which is calculated by the standard VQA metric [3].\nBaselines. Given an image and a question about the image, the visual question answering task aims\nto generate the correct answer, which is modeled as a classi\ufb01cation task. We choose Up-Down [2]\nand BAN [12] for comparison. They both use region-based visual features as image representations\nand GRU-encoded hidden states as question representations, and make classi\ufb01cation based on their\ncombination. However, Up-Down only uses the \ufb01nal sentence vector to obtain the weight of each\nvisual region, while BAN uses a bilinear attention to obtain the weight for each pair of visual region\nand question word. BAN is the previous state-of-the-art on the VQA v2.0 dataset.\nResults. As shown in Table 3, an overall improvement is achieved when applying MIA to the\nbaselines, which validates that our method generalizes well to different tasks. Especially, on the\nanswer type Number, the MIA promotes the accuracy of Up-Down from 47.5% to 51.2% and BAN\nfrom 50.9% to 53.1%. The signi\ufb01cant improvements suggest that the re\ufb01ned image representations\nare more accurate in counting thanks to integrating semantically related objects.\n\n3.3 Analysis\n\nIn this section, we analyze the effect of the proposed approach and provide insights of the MIA\nmodule, in an attempt to answer the following questions: (1) Is the mutual attention necessary for\nintegrating semantically-related features? (2) Is the improvement spurious because MIA uses two\nkinds input features while some of the baseline models only use one? (3) How does the iteration time\naffect the alignment process? and (4) Does the mutual attention actually align the two modality?\nEffect of mutual attention. Mutual attention serves as a way to integrate correlated features by\naligning modalities, which is our main proposal. Another way to integrate features is to only rely\non information from one domain, which can be achieved by replacing mutual attention with self-\nattention. However, this method is found to be less effective than MIA, scoring 96.6 and 105.4 for\nVisual Attention and Concept Attention, respectively, in terms of CIDEr. Especially, the performance\nof the Visual Attention has even been impaired, which suggests that only using information from\none domain is insuf\ufb01cient to construct meaningful region or concept groups that are bene\ufb01cial to\ndescribing images and con\ufb01rms our main motivation. Besides, as the self-attention and the mutual\nattention shares the same multi-head attention structure, it also indicates that the improvement comes\nfrom the alignment of the two modalities rather than the application of the attention structure.\nAblation Study. As the deployment of MIA inevitably introduces information from the other\nmodality, we conduct ablation studies to investigate whether the improvement is derived from the\nwell-aligned and integrated image representations or the additional source information. As shown in\n\n7\n\n54.0 54.2 54.4 54.6 54.8 55.0 55.2 55.4 55.6 55.8 12345ROUGE-L NVACA25.6 25.8 26.0 26.2 26.4 26.6 26.8 27.0 27.2 27.4 12345METEOR NVACA31.8 32.0 32.2 32.4 32.6 32.8 33.0 33.2 33.4 33.6 12345BLEU-4 NVACA100.0 101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0 109.0 12345CIDErNVACA\fFigure 5: Visualization of the integrated image representations. Please view in color. We show\nthe representations with different iteration N for two images. We choose three visual features and\ncorresponding textual concepts with clear semantic implication and highlight them with distinct\ncolors. As we see, with N increasing, the alignment becomes more focused and more speci\ufb01c, but the\ncombination of related features are less represented.\n\nTable 4, when using the same single-modal features as the corresponding baselines, our method can\nstill promote the performance. Thanks to the mutual iterative attention process, \u201cVisual Attention\nw/ IN \u201d and \u201cConcept Attention w/ IN \u201d can pay attention to integrated visual features and textual\nconcepts, respectively. This frees the decoder from associating the unrelated original features in each\ndomain, which may explain for the improvements. The performance in terms of SPICE and CIDEr is\nfurther elevated when TN and IN are combined. The progressively increased scores demonstrate that\nthe improvements indeed come from the re\ufb01ned semantic-grounded image representations produced\nby MIA, rather than the introduction of additional information.\nThe SPICE sub-category results show that IN helps the baselines to generate captions that are more\ndetailed in count and size, TN results in more comprehensiveness in objects, and MIA can help the\nbaselines to achieve a caption that is detailed in all sub-categories. Due to limited space, the scores\nare provided in the supplementary materials. For output samples and intuitive comparisons, please\nrefer to the supplementary materials.\nEffect of iteration times. We select two representative models, i.e., Visual Attention and Concept\nAttention, to analyze the effect of iteration times. Figure 4 presents the performance of Visual\nAttention (VA) and Concept Attention (CA) under different evaluation metrics when equipped with\nthe MIA. We evaluate with iteration times ranging from 1 to 5. The scores \ufb01rst rise and then decline\nwith the increase of N, as a holistic trend. With one accord, the performances consistently reach the\nbest at the second iteration, for the reason of which we set N = 2. It suggests that a single iteration\ndoes not suf\ufb01ce to align visual features and textual concepts. With each round of mutual attention,\nthe image representations become increasingly focused, which explains the promotion in the \ufb01rst few\niterations. As for the falling back phenomenon, we speculate that the integration effect of MIA can\nalso unexpectedly eliminate some useful information by assigning them low attention weights. The\nabsent of these key elements results in less comprehensive captions. The visualization in Figure 5\nalso attests to our arguments.\nVisualization. We visualize the integration of the image representations in Figure 5. The colors in\nthe images and the heatmaps re\ufb02ect the accumulated attention weights assigned to the original image\nrepresentations until the current iteration. As we can see in the left plots of Figure 5, the attended\nvisual regions are general in the \ufb01rst iteration, thereby assigning comparable weights to a number\nof visual words with low relevance. Taking the indoor image as an example, the red-colored visual\n\n8\n\nRGsnowskisflyingmanstandingmountainridingpeopleskiersslopeskyhillcoveredbluegroupairRGsnowskisflyingmanstandingmountainridingpeopleskiersslopeskyhillcoveredbluegroupairRGsnowskisflyingmanstandingmountainridingpeopleskiersslopeskyhillcoveredbluegroupairBBBN = 1N = 3N = 5N = 1N = 3N = 5RGBdeskcomputerpicturestelevisionlaptopkeyboardmonitorscreenmonitorstabledesktopcomputersthreewhitemouselaptopsRGBdeskcomputerpicturestelevisionlaptopkeyboardmonitorscreenmonitorstabledesktopcomputersthreewhitemouselaptopsRGBdeskcomputerpicturestelevisionlaptopkeyboardmonitorscreenmonitorstabledesktopcomputersthreewhitemouselaptops\fregion in the left plot focuses not only on the related words (e.g. computer and monitor) but also the\nwords that describe peripheral objects (e.g. pictures on the wall), and words that are incorrect (e.g.\ntelevision). In this case, the inter-domain alignment is weak and the integration of features within\na certain domain is not concentrated, making the image representations undesirable. As the two\nmodalities iteratively attend to each other, the features in the two domains gradually concentrate on\nconcrete objects and corresponding visual words. In the third iteration where the model performance\npeaks (among the visualized iterations), the boundaries of the visual regions are well-de\ufb01ned and\nthe dominant visual words making up the textual concepts are satisfactory. However, the features\nare over-concentrated in the \ufb01fth iteration, \ufb01ltering out some requisite information. For example,\nthe red region shrinks to a single person in the \ufb01rst example, and a single monitor in the second\nexample, which reduces the information about number (e.g., group, three, computers and monitors)\nand attribute (e.g., skis). Hence, it is necessary to decide an appropriate number of iteration for\nacquiring better image representations.\n\n4 Related Work\n\nRepresenting images. A number of neural approaches have been proposed to obtain image repre-\nsentations in various forms. An intuitive method is to extract visual features using a CNN or a RCNN.\nThe former splits an image into a uniform grid of visual regions (Figure 1 (a)), and the latter produces\nobject-level visual features based on bounding boxes (Figure 1 (b)), which has proven to be more\neffective. For image captioning, Fang et al. [8], Wu et al. [31] and You et al. [35] augmented the\ninformation source with textual concepts that are given by a predictor, which is trained to \ufb01nd the most\nfrequent words in the captions. A most recent advance [34] built graphs over the RCNN-detected\nvisual regions, whose relationships are modeled as directed edges in a scene-graph, which is further\nencoded via a Graph Convolutional Network (GCN).\n\nVisual-semantic alignment. To acquire integrated image representations, we introduce the Mutual\nIterative Attention (MIA) strategy, which is based on the self-attention mechanism [28], to align the\nvisual features and textual concepts. It is worth noticing that for image captioning, Karpathy and\nLi [11] also introduced the notion of visual-semantic alignment. They endowed the RCNN-based\nvisual features with semantic information by minimizing their distance in a multimodal embedding\nspace with corresponding segments of the ground-truth caption, which is quite different from our\nconcept-based iterative alignment. In the \ufb01eld of VQA, some recent efforts [19, 12, 6, 20] have\nalso been dedicated to study the image-question alignment. Such alignment intends to explore the\nlatent relation between important question words and image regions. Differently, we focus on a\nmore general purpose of building semantic-grounded image representations through the alignment\nbetween visual regions and corresponding textual concepts. The learned semantic-grounded image\nrepresentations, as shown by our experiments, are complementary to the VQA models that are based\non image-question alignment.\n\n5 Conclusions\n\nWe focus on building integrated image representations to describe salient image regions from both\nvisual and semantic perspective to address the lack of structural relationship among individual\nfeatures. The proposed Mutual Iterative Attention (MIA) strategy aligns the visual regions and textual\nconcepts by conducting mutual attention over the two modalities in an iterative way. The re\ufb01ned\nimage representations may provide a better start point for vision-and-language grounding problems.\nIn our empirical studies on the MSCOCO image captioning dataset and the VQA v2.0 dataset, the\nproposed MIA exhibits compelling effectiveness in boosting the baseline systems. The results and\nrelevant analysis demonstrate that the semantic-grounded image representations are essential to\nthe improvements and generalize well to a wide range of existing systems for vision-and-language\ngrounding tasks.\n\nAcknowledgments\n\nThis work was supported in part by National Natural Science Foundation of China (No. 61673028).\nWe thank all the anonymous reviewers for their constructive comments and suggestions. Xu Sun is\nthe corresponding author of this paper.\n\n9\n\n\fReferences\n[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. SPICE: Semantic propositional image\n\ncaption evaluation. In ECCV, 2016.\n\n[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and\n\ntop-down attention for image captioning and VQA. In CVPR, 2018.\n\n[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual\n\nquestion answering. In ICCV, 2015.\n\n[4] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,\n\n2016.\n\n[5] S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved\n\ncorrelation with human judgments. In IEEvaluation@ACL, 2005.\n\n[6] H. Ben-younes, R. Cad\u00e8ne, M. Cord, and N. Thome. MUTAN: Multimodal tucker fusion for\n\nvisual question answering. In ICCV, 2017.\n\n[7] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Doll\u00e1r, and C. L. Zitnick. Microsoft COCO\n\ncaptions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.\n\n[8] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Doll\u00e1r, J. Gao, X. He, M. Mitchell,\nJ. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. In CVPR,\n2015.\n\n[9] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter:\n\nElevating the role of image understanding in visual question answering. In CVPR, 2017.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[11] A. Karpathy and F. Li. Deep visual-semantic alignments for generating image descriptions. In\n\nCVPR, 2015.\n\n[12] J. Kim, J. Jun, and B. Zhang. Bilinear attention networks. In NeurIPS, 2018.\n\n[13] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li,\nD. A. Shamma, M. S. Bernstein, and F. Li. Visual Genome: Connecting language and vision\nusing crowdsourced dense image annotations. IJCV, 2017.\n\n[14] C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In ACL Workshop, 2004.\n\n[15] F. Liu, X. Ren, Y. Liu, H. Wang, and X. Sun. simNet: Stepwise image-topic merging network\n\nfor generating detailed and comprehensive image captions. In EMNLP, 2018.\n\n[16] F. Liu, X. Ren, Y. Liu, K. Lei, and X. Sun. Exploring and distilling cross-modal information for\n\nimage captioning. In IJCAI, 2019.\n\n[17] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a\n\nvisual sentinel for image captioning. In CVPR, 2017.\n\n[18] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In CVPR, 2018.\n\n[19] H. Nam, J. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. In\n\nCVPR, 2017.\n\n[20] D. Nguyen and T. Okatani. Improved fusion of visual and language representations by dense\n\nsymmetric co-attention for visual question answering. In CVPR, 2018.\n\n[21] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with transferred semantic attributes. In\n\nCVPR, 2017.\n\n[22] K. Papineni, S. Roukos, T. Ward, and W. Zhu. BLEU: A method for automatic evaluation of\n\nmachine translation. In ACL, 2002.\n\n10\n\n\f[23] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection\n\nwith region proposal networks. In NIPS, 2015.\n\n[24] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for\n\nimage captioning. In CVPR, 2017.\n\n[25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[26] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a\n\nsimple way to prevent neural networks from over\ufb01tting. JMLR, 2014.\n\n[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\n\nA. Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and\n\nI. Polosukhin. Attention is all you need. In NIPS, 2017.\n\n[29] R. Vedantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description\n\nevaluation. In CVPR, 2015.\n\n[30] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption\n\ngenerator. In CVPR, 2015.\n\n[31] Q. Wu, C. Shen, L. Liu, A. R. Dick, and A. van den Hengel. What value do explicit high level\n\nconcepts have in vision to language problems? In CVPR, 2016.\n\n[32] X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for image captioning. In\n\nCVPR, 2019.\n\n[33] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image captioning with attributes. In ICCV,\n\n2017.\n\n[34] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In ECCV,\n\n2018.\n\n[35] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In\n\nCVPR, 2016.\n\n[36] C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance boosting for object detection. In NIPS,\n\n2006.\n\n11\n\n\f", "award": [], "sourceid": 3726, "authors": [{"given_name": "Fenglin", "family_name": "Liu", "institution": "Peking University"}, {"given_name": "Yuanxin", "family_name": "Liu", "institution": "Institute of Information Engineering, Chinese Academy of Sciences; SCS, University of Chinese Academy of Sciences"}, {"given_name": "Xuancheng", "family_name": "Ren", "institution": "Peking University"}, {"given_name": "Xiaodong", "family_name": "He", "institution": "JD AI research"}, {"given_name": "Xu", "family_name": "Sun", "institution": "Peking University"}]}