{"title": "Image Captioning: Transforming Objects into Words", "book": "Advances in Neural Information Processing Systems", "page_first": 11137, "page_last": 11147, "abstract": "Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder.\nOne of the most successful algorithms uses feature vectors extracted from the region proposals obtained from an object detector. In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. Quantitative and qualitative results demonstrate the importance of such geometric attention for image captioning, leading to improvements on all common captioning metrics on the MS-COCO dataset. Code is available at https://github.com/yahoo/object_relation_transformer .", "full_text": "Image Captioning: Transforming Objects into Words\n\nSimao Herdade, Armin Kappeler, Ko\ufb01 Boakye, Joao Soares\n\n{sherdade,kaboakye,jvbsoares}@verizonmedia.com, akappeler@apple.com\n\nYahoo Research\n\nSan Francisco, CA, 94103\n\nAbstract\n\nImage captioning models typically follow an encoder-decoder architecture which\nuses abstract image feature vectors as input to the encoder. One of the most\nsuccessful algorithms uses feature vectors extracted from the region proposals\nobtained from an object detector. In this work we introduce the Object Relation\nTransformer, that builds upon this approach by explicitly incorporating information\nabout the spatial relationship between input detected objects through geometric\nattention. Quantitative and qualitative results demonstrate the importance of such\ngeometric attention for image captioning, leading to improvements on all common\ncaptioning metrics on the MS-COCO dataset. Code is available at https://\ngithub.com/yahoo/object_relation_transformer.\n\n1\n\nIntroduction\n\nImage captioning\u2014the task of providing a natural language description of the content within an\nimage\u2014lies at the intersection of computer vision and natural language processing. As both of\nthese research areas are highly active and have experienced many recent advances, progress in\nimage captioning has naturally followed suit. On the computer vision side, improved convolutional\nneural network and object detection architectures have contributed to improved image captioning\nsystems. On the natural language processing side, more sophisticated sequential models, such as\nattention-based recurrent neural networks, have similarly resulted in more accurate caption generation.\nInspired by neural machine translation, most conventional image captioning systems utilize an\nencoder-decoder framework, in which an input image is encoded into an intermediate representation\nof the information contained within the image, and subsequently decoded into a descriptive text\nsequence. This encoding can consist of a single feature vector output of a CNN (as in [25]), or\nmultiple visual features obtained from different regions within the image. In the latter case, the\nregions can be uniformly sampled (e.g., [26]), or guided by an object detector (e.g., [2]) which has\nbeen shown to yield improved performance.\nWhile these detection based encoders represent the state-of-the art, at present they do not utilize\ninformation about the spatial relationships between the detected objects, such as relative position and\nsize. This information can often be critical to understanding the content within an image, however,\nand is used by humans when reasoning about the physical world. Relative position, for example,\ncan aid in distinguishing \u201ca girl riding a horse\u201d from \u201ca girl standing beside a horse\u201d. Similarly,\nrelative size can help differentiate between \u201ca woman playing the guitar\u201d and \u201ca woman playing the\nukelele\u201d. Incorporating spatial relationships has been shown to improve the performance of object\ndetection itself, as demonstrated in [9]. Furthermore, in machine translation encoders, positional\nrelationships are often encoded, in particular in the case of the Transformer [23], an attention-based\nencoder architecture. The use of relative positions and sizes of detected objects, then, should be of\nbene\ufb01t to image captioning visual encoders as well, as evidenced in Figure 1.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: A visualization of self-attention in our proposed Object Relation Transformer. The\ntransparency of the detected object and its bounding box is proportional to the attention weight with\nrespect to the chair outlined in red. Our model strongly correlates this chair with the companion\nchair to the left, the beach beneath them, and the umbrella above them, relationships displayed in the\ngenerated caption.\n\nFigure 2: Overview of Object Relation Transformer architecture. The Bounding Box Relational\nEncoding diagram describes the changes made to the Transformer architecture\n\nIn this work, we propose and demonstrate the use of object spatial relationship modeling for image\ncaptioning, speci\ufb01cally within the Transformer encoder-decoder architecture. This is achieved by\nincorporating the object relation module of [9] within the Transformer encoder. The contributions of\nthis paper are as follows:\n\n\u2022 We introduce the Object Relation Transformer, an encoder-decoder architecture designed\nspeci\ufb01cally for image captioning, that incorporates information about the spatial relation-\nships between input detected objects through geometric attention.\n\n\u2022 We quantitatively demonstrate the usefulness of geometric attention through both baseline\n\ncomparison and an ablation study on the MS-COCO dataset.\n\n\u2022 Lastly, we qualitatively show that geometric attention can result in improved captions that\n\ndemonstrate enhanced spatial awareness.\n\n2\n\n\f2 Related Work\n\nMany early neural models for image captioning [17, 12, 5, 25] encoded visual information using\na single feature vector representing the image as a whole, and hence did not utilize information\nabout objects and their spatial relationships. Karpathy and Fei-Fei in [11], as a notable exception\nto this global representation approach, extracted features from multiple image regions based on an\nR-CNN object detector [7] and generated separate captions for the regions. As a separate caption\nwas generated for each region, however, the spatial relationship between the detected objects was\nnot modeled. This is also true of their follow-on dense captioning work [10], which presented an\nend-to-end approach for obtaining captions relating to different regions within an image. Fang et al.\nin [6] generated image descriptions by \ufb01rst detecting words associated with different regions within\nthe image. The spatial association was made by applying a fully convolutional neural network to the\nimage and generating spatial response maps for the target words. Here again, the authors did not\nexplicitly model any relationships between the spatial regions.\nA family of attention based approaches [26, 30, 28] to image captioning have also been proposed that\nseek to ground the words in the predicted caption to regions in the image. As the visual attention is\noften derived from higher convolutional layers of a CNN, the spatial localization is limited and often\nnot semantically meaningful. Most similar to our work, Anderson et al. in [2] addressed this limitation\nof typical attention models by combining a \u201cbottom-up\u201d attention model with a \u201ctop-down\u201d LSTM.\nThe bottom-up attention acts on mean-pooled convolutional features obtained from the proposed\nregions of interest of a Faster R-CNN object detector [20]. The top-down LSTM is a two-layer LSTM\nin which the \ufb01rst layer acts as a visual attention model that attends to the relevant detections for the\ncurrent token and the second layers is a language LSTM that generates the next token. The authors\ndemonstrated state-of-the-art performance for both visual question answering and image captioning\nusing this approach, indicating the bene\ufb01ts of combining features derived from object detection\nwith visual attention. Again, spatial information\u2014which we propose in this work via geometric\nattention\u2014was not utilized. Geometric attention was \ufb01rst introduced by Hu et al. for object detection\nin [9]. There, the authors used bounding box coordinates and sizes to infer the importance of the\nrelationship of pairs of objects, the assumption being that if two bounding boxes are closer and more\nsimilar in size to each other, then their relationship is stronger.\nThe most successful subsequent work followed the above paradigm of obtaining image features with\nan object detector, and generating captions through an attention LSTM. As a way of adding global\ncontext, Yao et al. in [29] introduced two Graph Convolutional Networks: a semantic relationship\ngraph, and a spatial relationship graph that classi\ufb01es the relationship between two boxes into 11\nclasses, such as \u201cinside\u201d, \u201ccover\u201d, or \u201coverlap\u201d. In contrast, our approach directly utilizes the size\nratio and difference of the bounding box coordinates, implicitly encoding and generalizing the\naforementioned relationships. Yang et al. in [27] similarly leveraged graph structures, extracting\nobject image features into an image scene graph. In addition, they used a semantic scene graph (i.e.,\na graph of objects, their relationships, and their attributes) autoencoder on caption text to embed a\nlanguage inductive bias in a dictionary that is shared with the image scene graph. While this model\nmay learn typical spatial relationships found in text, it is inherently unable to capture the visual\ngeometry speci\ufb01c to a given image. The use of self-critical reinforcement learning for sentence\ngeneration [21] has also proven to be important for state-of-the-art captioning approaches, such as\nthose above. Liu et al. in [15] proposed an alternative reinforcement learning approach over a visual\npolicy that, in effect, acts as an attention mechanism to combine features from the image regions\nprovided by an object detector. The visual policy, however, does not utilize spatial information about\nthese image regions.\nRecent developments in NLP, namely the Transformer architecture [23] have led to signi\ufb01cant\nperformance improvements for various tasks such as translation [23], text generation [4], and language\nunderstanding [19]. In [22], the Transformer was applied to the task of image captioning. The authors\nexplored extracting a single global image feature from the image as well as uniformly sampling\nfeatures by dividing the image into 8x8 partitions. In the latter case, the feature vectors were fed\nin a sequence to the Transformer encoder. In this paper we propose to improve upon this uniform\nsampling by adopting the bottom-up approach of [2]. The Transformer architecture is particularly\nwell suited as a bottom-up visual encoder for captioning since it does not have a notion of order for its\ninputs, unlike an RNN. It can, however, successfully model sequential data with the use of positional\nencoding, which we apply to the decoded tokens in the caption text. Rather than encode an order to\n\n3\n\n\fobjects, our Object Relation Transformer seeks to encode how two objects are spatially related to\neach other and weight them accordingly.\n\n3 Proposed Approach\n\nFigure 2 shows an overview of the proposed image captioning algorithm. We \ufb01rst use an object\ndetector to extract appearance and geometry features from all the detected objects in the image,\nas described in Section 3.1. Thereafter, we use the Object Relation Transformer to generate the\ncaption text. Section 3.2 describes how we use the Transformer architecture [23] in general for image\ncaptioning. Section 3.3 explains our novel addition of box relational encoding to the encoder layer of\nthe Transformer.\n\n3.1 Object Detection\n\nFollowing [2], we use Faster R-CNN [20] with ResNet-101 [8] as the base CNN for object detection\nand feature extraction. Using intermediate feature maps from the ResNet-101 as inputs, a Region\nProposal Network (RPN) generates bounding boxes for object proposals. Using non-maximum\nsuppression, overlapping bounding boxes with an intersection-over-union (IoU) exceeding a threshold\nof 0.7 are discarded. A region-of-interest (RoI) pooling layer is then used to convert all remaining\nbounding boxes to the same spatial size (e.g. 14 \u00d7 14 \u00d7 2048). Additional CNN layers are applied\nto predict class labels and bounding box re\ufb01nements for each box proposal. We further discard all\nbounding boxes where the class prediction probability is below a threshold of 0.2. Finally, we apply\nmean-pooling over the spatial dimension to generate a 2048-dimensional feature vector for each\nobject bounding box. These feature vectors are then used as inputs to the Transformer model.\n\n3.2 Standard Transformer Model\n\nThe Transformer [23] model consists of an encoder and a decoder, both of which are composed of a\nstack of layers (in our case 6). For image captioning, our architecture uses the feature vectors from\nthe object detector as inputs and generates a sequence of words (i.e., the image caption) as outputs.\nEvery image feature vector is \ufb01rst processed through an input embedding layer, which consists of a\nfully-connected layer to reduce the dimension from 2048 to dmodel = 512 followed by a ReLU and a\ndropout layer. The embedded feature vectors are then used as input tokens to the \ufb01rst encoder layer\nof the Transformer model. We denote xn as the n-th token of a set of N tokens. For encoder layers 2\nto 6, we use the output tokens of the previous encoder layer as the input to the current layer.\nEach encoder layer consists of a multi-head self-attention layer followed by a small feed-forward\nneural network. The self-attention layer itself consists of 8 identical heads. Each attention head \ufb01rst\ncalculates the queries Q, keys K and values V for the N tokens as follows\n\nQ = XWQ, K = XWK, V = XWV ,\n\n(1)\nwhere X contains all the input vectors x1...xN stacked into a matrix and WQ, WK, and WV are\nlearned projection matrices.\nThe attention weights for the appearance features are then computed according to\n\n\u2126A =\n\nQK T\u221a\ndk\n\n(2)\n\nwhere \u2126A is an N \u00d7 N attention weight matrix, whose elements \u03c9mn\nA are the attention weights\nbetween the m-th and n-th token. Following the implementation of [23], we choose a constant scaling\nfactor of dk = 64, which is the dimension of the key, query, and value vectors. The output of the\nhead is then calculated as\n\nhead(X) = self-attention(Q, K, V ) = softmax(\u2126A)V\n\n(3)\n\nEquations 1 to 3 are calculated for every head independently. The output of all 8 heads are then\nconcatenated to one output vector and multiplied with a learned projection matrix WO, i.e.,\n\nMultiHead(Q, K, V ) = Concat(head1, . . . , headh)WO\n\n(4)\n\n4\n\n\fThe next component of the encoder layer is the point-wise feed-forward network (FFN), which is\napplied to each output of the attention layer\n\nFFN(x) = max(0, xW1 + b1)W2 + b2\n\n(5)\nwhere W1,b1 and W2,b2 are the weights and biases of two fully connected layers. In addition,\nskip-connections and layer-norm are applied to the outputs of the self-attention and the feed-forward\nlayer.\nThe decoder then uses the generated tokens from the last encoder layer as input to generate the caption\ntext. Since the dimensions of the output tokens of the Transformer encoder are identical to the tokens\nused in the original Transformer implementation, we make no modi\ufb01cations on the decoder side. We\nrefer the reader to the original publication [23] for a detailed explanation of the decoder.\n\n3.3 Object Relation Transformer\n\nIn our proposed model, we incorporate relative geometry by modifying the attention weight matrix\n\u2126A in Equation 2. We multiply the appearance based attention weights \u03c9mn\nA of two objects m and\nn, by a learned function of their relative position and size. We use the same function that was \ufb01rst\nintroduced in [9] to improve the classi\ufb01cation and non-maximum suppression stages of a Faster\nR-CNN object detector.\nFirst we calculate a displacement vector \u03bb(m, n) for bounding boxes m and n from their geometry\nfeatures (xm, ym, wm, hm) and (xn, yn, wn, hn) (center coordinates, widths, and heights) as\n\n(cid:18)\n\n(cid:18)|xm \u2212 xn|\n\n(cid:19)\n\nwm\n\n(cid:18)|ym \u2212 yn|\n\n(cid:19)\n\nhm\n\n(cid:18) wn\n\n(cid:19)\n\nwm\n\n(cid:18) hn\n\n(cid:19)(cid:19)\n\nhm\n\n\u03bb(m, n) =\n\nlog\n\n, log\n\n, log\n\n, log\n\n,\n\n(6)\n\nThe geometric attention weights are then calculated as\n\nG = ReLU (Emb(\u03bb)WG)\n\u03c9mn\n\n(7)\nwhere Emb(\u00b7) calculates a high-dimensional embedding following the functions P Epos described in\n[23], where sinusoid functions are computed for each value of \u03bb(m, n). In addition, we multiply the\nembedding with the learned vector WG to project down to a scalar and apply the ReLU non-linearity.\nThe geometric attention weights \u03c9mn\nG are then incorporated into the attention mechanism according\nto\n\n\u03c9mn =\n\n(cid:80)N\n\nG exp(\u03c9mn\n\u03c9mn\nA )\nG exp(\u03c9ml\nl=1 \u03c9ml\nA )\n\n(8)\n\nA are the appearance based attention weights from Equation 2 and \u03c9mn are the new\n\nwhere \u03c9mn\ncombined attention weights.\nThe output of the head can be calculated as\n\nhead(X) = self-attention(Q, K, V ) = \u2126V\n\n(9)\n\nwhere \u2126 is the N \u00d7 N matrix whose elements are given by \u03c9mn.\nThe Bounding Box Relational Encoding diagram in Figure 2 shows the multi-head self-attention layer\nof the Object Relation Transformer. Equations 6 to 9 are represented with the Relation boxes.\n\n4\n\nImplementation Details\n\nOur algorithm was developed in PyTorch using the image captioning implementation in [16] as our\nbasis. We ran our experiments on NVIDIA Tesla V100 GPUs. Our best performing model was\npre-trained for 30 epochs with a softmax cross-entropy loss using the ADAM optimizer with learning\nrate de\ufb01ned as in the original Transformer paper, with 20000 warmup steps, and a batch size of 10.\nWe trained for an additional 30 epochs using self-critical reinforcement learning [21] optimizing for\nCIDEr-D score, and did early-stopping for best performance on the validation set (which contains\n5000 images). On a single GPU the training with cross-entropy loss and the self-critical training take\nabout 1 day and 3.5 days, respectively.\n\n5\n\n\fTable 1: Comparative analysis to existing state-of-the-art approaches. The model denoted as Ours\nrefers to the Object Relation Transformer \ufb01ne-tuned using self-critical training and generating\ncaptions using beam search with beam size 5.\n\nCIDEr-D SPICE BLEU-1 BLEU-4 METEOR ROUGE-L\n\nAlgorithm\nAtt2all [21]\nUp-Down [2]\n\nVisual-policy[15]\nGCN-LSTM [29]1\n\nSGAE [27]\n\nOurs\n\n114\n120.1\n126.3\n127.6\n127.8\n128.3\n\n-\n\n21.4\n21.6\n22.0\n22.1\n22.6\n\n-\n\n\u2013\n\n79.8\n\n80.5\n80.8\n80.5\n\n34.2\n36.3\n38.6\n38.2\n38.4\n38.6\n\n26.7\n27.7\n28.3\n28.5\n28.4\n28.7\n\n55.7\n56.9\n58.5\n58.3\n58.6\n58.4\n\nThe models compared in sections 5.3-5.6 are evaluated after training for 30 epochs with standard\ncross-entropy loss, using ADAM optimization with the above learning rate schedule, and with batch\nsize 15. The evaluation in those sections for the best performing models was obtained setting beam\nsize to 2, in consistency with other research on image captioning optimization [21] (appendix A).\nOnly in Table 1, for a fair comparison with other models in the literature, we present our result with\nthe same beam size of 5 that other works have used to communicate their performance.\n\n5 Experimental Evaluation\n\n5.1 Dataset and Metrics\n\nWe trained and evaluated our algorithm on the Microsoft COCO (MS-COCO) 2014 Captions\ndataset [14]. We report results on the Karpathy validation and test splits [11], which are com-\nmonly used in other image captioning publications. The dataset contains 113K training images with\n5 human annotated captions for each image. The Karpathy test and validation sets contain 5K images\neach. We evaluate our models using the CIDEr-D [24], SPICE [1], BLEU [18], METEOR [3], and\nROUGE-L [13] metrics. While it has been shown experimentally that BLEU and ROUGE have lower\ncorrelation with human judgments than the other metrics [1, 24], the common practice in the image\ncaptioning literature is to report all the aforementioned metrics.\n\n5.2 Comparative Analysis\n\nWe compare our proposed algorithm against the best results from a single model1 of the self-critical\nsequence training (Att2all) [21] the Bottom-up Top-down (Up-Down) [2] baseline, and the three best\nto date image captioning models [15, 29, 27]. Table 1 shows the metrics for the test split as reported\nby the authors. Following the implementation of [2], we \ufb01ne-tune our model using the self-critical\ntraining optimized for CIDEr-D score [21] and apply beam search with beam size 5, achieving a 6.8%\nrelative improvement over the Up-Down baseline, as well as the state-of-the-art for the captioning\nspeci\ufb01c metrics CIDEr-D, SPICE, as well as METEOR, and BLEU-4.\n\n5.3 Positional Encoding\n\nOur proposed geometric attention can be seen as a replacement for the positional encoding of the\noriginal Transformer network. While objects do not have an inherent notion of order, there do exist\nsome simpler analogues to positional encoding, such as ordering by object size, or left-to-right or\ntop-to-bottom based on bounding box coordinates. We provide a comparison between our geometric\nattention and these object orderings in Table 2. For box size, we simply calculate the area of each\nbounding box and order from largest to smallest. For left-to-right we order bounding boxes according\nto the x-coordinate of their centroids. Analogous ordering is performed for top-to-bottom using the\ncentroid y-coordinate. Based on the CIDEr-D scores shown, adding such an arti\ufb01cial ordering to the\ndetected objects decreases the performance. We observed similar decreases in performance across all\nother metrics (SPICE, BLEU, METEOR and ROUGE-L).\n\n1Some publications include results obtained from an ensemble of models. Speci\ufb01cally, the ensemble of two\ndistinct graph convolution networks in GCN-LSTM [29] achieves a superior CIDEr-D score to our stand-alone\nmodel.\n\n6\n\n\fTable 2: Positional Encoding Comparison (models trained with softmax cross-entropy for 30 epochs)\n\nPositional Encoding\n\nno encoding\n\npositional encoding (ordered by box size)\npositional encoding (ordered left-to-right)\npositional encoding (ordered top-to-bottom)\n\ngeometric attention\n\nCIDEr-D\n\n111.0\n108.7\n110.2\n109.1\n112.6\n\nTable 3: Ablation Study. All metrics are reported for the validation and the test split, after training\nwith softmax cross-entropy for 30 epochs. The Transformer (Transf) and the Object Relational\nTransformer (ObjRel Transf) is described in detail in Section 3\n\nBLEU-1 BLEU-4 METEOR ROUGE-L\n\nAlgorithm\n\nUp-Down + LSTM\n\nUp-Down + Transf\n\nUp-Down + ObjRel Transf\nUp-Down + ObjRel Transf\n\n+ Beamsize 2\n\nCIDEr-D SPICE\n19.7\n19.9\n20.8\n20.9\n21.0\n20.8\n21.1\n21.2\n\n105.6\n106.6\n110.5\n111.0\n113.2\n112.6\n114.7\n115.4\n\nval\ntest\nval\ntest\nval\ntest\nval\ntest\n\n5.4 Ablation Study\n\n75.5\n75.6\n75.2\n75.0\n76.1\n75.6\n76.5\n76.6\n\n32.9\n32.9\n33.3\n32.8\n34.4\n33.5\n35.5\n35.5\n\n26.5\n26.5\n27.6\n27.5\n27.7\n27.6\n27.9\n28.0\n\n55.6\n55.4\n55.8\n55.6\n56.4\n56.0\n56.6\n56.6\n\nTable 3 shows the results for our ablation study. We show the Bottom-Up and Top-Down algorithm [2]\nas our baseline algorithm. The second row replaces the LSTM with a Transformer network. The third\nrow includes the proposed geometric attention. The last row includes beam search with beam size\n2. The contribution of the Object Relation Transformer is small for METEOR, but signi\ufb01cant for\nCIDEr-D and the BLEU metrics. Overall we can see the most improvements on the CIDEr-D and\nBLEU-4 scores.\n\n5.5 Geometric Improvement\n\nIn order to demonstrate the advantages of the geometric attention layer, we performed a more detailed\ncomparison of the Object Relation Transformer against the Standard Transformer. For each of the\nconsidered metrics, we performed a two-tailed t-test with paired samples in order to determine\nwhether the difference caused by adding the geometric attention was statistically signi\ufb01cant. The\nmetrics were \ufb01rst computed for each individual image in the test set for each of the two Transformer\nmodels, so that we are able to run the paired tests. In addition to the standard evaluation metrics, we\nalso report metrics obtained from SPICE by splitting up the tuples of the scene graphs according\nto different semantic subcategories. For each subcategory, we are able to compute precision, recall,\nand F-scores. The measures we report are the F-scores computed by taking only the tuples in each\nsubcategory. More speci\ufb01cally, we report SPICE scores for: Object, Relation, Attribute, Color, Count,\nand Size [1]. Note that for a given image, not all SPICE subcategory scores might be available. For\nexample, if the reference captions for a given image have no mention of color, then the SPICE Color\nscore is not de\ufb01ned and therefore we omit that image from that particular analysis. In spite of this,\neach subcategory analyzed had at least 1000 samples. For this experiment, we did not use self-critical\ntraining for either Transformer and they were both run with a beam size of 2.\nThe metrics computed over the 5000 images of the test set are shown in Tables 4 and 5. We \ufb01rst\nnote that for all of the metrics, the Object Relation Transformer presents higher scores than the\nStandard Transformer. The score difference was statistically signi\ufb01cant (using a signi\ufb01cance level\n\u03b1 = 0.05) for CIDEr-D, BLEU-1, ROUGE-L (Table 4), Relation, and Count (Table 5). The signi\ufb01cant\nimprovements in CIDEr-D and Relation are in line with our expectation that adding the geometric\nattention layer would help the model in determining the correct relationships between objects. In\naddition, it is interesting to see a signi\ufb01cant improvement in the Count subcategory of SPICE, from\n11.30 to 17.51. Though image captioning methods in general show a large de\ufb01cit in Count scores\nwhen compared with humans [1], we are able to show a signi\ufb01cant improvement by adding explicit\npositional information. Some examples illustrating these improvements are presented in Section 5.6.\n\n7\n\n\fTable 4: Comparison of different captioning metrics for the Standard Transformer and our proposed\nObject Relation Transformer (denoted Ours below), trained with softmax cross-entropy for 30 epochs.\nThe table shows that the Object Relation Transformer has signi\ufb01cantly higher CIDEr-D, BLEU-1 and\nROUGE-L scores. The p-values come from two-tailed t-tests using paired samples. Values marked in\nbold were considered signi\ufb01cant at \u03b1 = 0.05.\n\nAlgorithm\n\nStandard Transformer\n\nOurs\np-value\n\nCIDEr-D SPICE BLEU-1 BLEU-4 METEOR ROUGE-L\n113.21\n115.37\n0.01\n\n75.60\n76.63\n<0.001\n\n56.02\n56.58\n0.01\n\n34.58\n35.49\n0.051\n\n27.79\n27.98\n0.24\n\n21.04\n21.24\n0.15\n\nTable 5: Breakdown of SPICE metrics for the Standard Transformer and our proposed Object Relation\nTransformer (denoted Ours below), trained with softmax cross-entropy for 30 epochs. The table\nshows that the Object Relation Transformer has signi\ufb01cantly higher Relation and Count scores. The\np-values come from two-tailed t-tests using paired samples. Values marked in bold were considered\nsigni\ufb01cant at \u03b1 = 0.05.\n\nAlgorithm\n\nStandard Transformer\n\nOurs\np-value\n\nAll\n21.04\n21.24\n0.15\n\n5.6 Qualitative Analysis\n\nSPICE\n\nObject Relation Attribute Color\n37.83\n14.88\n15.49\n37.92\n0.64\n0.35\n\n11.31\n11.37\n0.81\n\n5.88\n6.31\n0.01\n\nCount\n11.30\n17.51\n<0.001\n\nSize\n5.82\n6.38\n0.34\n\nTo illustrate the advantages of the Object Relation Transformer relative to the Standard Transformer,\nwe present example images with the corresponding captions generated by each model. The captions\npresented were generated using the following setup: both the Object Relation Transformer and the\nStandard Transformer were trained without self-critical training and both were run with a beam size\nof 2 on the 5000 images of the test set. We chose examples for which there were was a marked\nimprovement in the score of the Object Relation Transformer relative to the Standard Transformer.\nThis was done for the Relation and Count subcategories of SPICE scores. The example images\nand captions are presented in Tables 6 and 7. The images in Table 6 illustrate an improvement in\ndetermining when a relationship between objects should be expressed, as well as in determining what\nthat relationship should be. An example of correctly determining that a relationship should exist is\nshown in the third image of Table 6, where the two chairs are actually related to the umbrella by\nbeing underneath it. In addition, an example where the Object Relation Transformer correctly infers\nthe type of relationship between objects is shown in the \ufb01rst image of Table 6, where the man in fact\nis not on the motorcycle, but is working on it. The examples in Table 7 speci\ufb01cally illustrate the\nObject Relation Transformer\u2019s marked ability to better count objects.\n\nTable 6: Example images and captions for which the SPICE Relation metric for Object Relation\nTransformer shows an improvement over the metric for the Standard Transformer.\n\nStandard: a man on a\nmotorcycle on the road\nOurs: a man is work-\ning on a motorcycle in\na parking lot\n\na couple of bears stand-\ning on top of a rock\n\ntwo chairs and an um-\nbrella on a beach\n\na laptop computer sitting\non top of a wooden desk\n\ntwo brown bears stand-\ning next to each other on\na rock\n\ntwo beach chairs under\nan umbrella on the beach\n\na desk with a laptop and\na keyboard\n\n8\n\n\fTable 7: Example images and captions for which the SPICE Count metric for the Object Relation\nTransformer shows an improvement over the metric for the Standard Transformer.\n\nStandard: a large bird\nis standing in a cage\n\na little girl sitting on top\nof a giraffe\n\nOurs:\ntwo large birds\nstanding in a fenced in\narea\n\na giraffe with two kids\nsitting on it\n\na group of young men\nriding skateboards down\na sidewalk\ntwo young men riding\nskateboards down a side-\nwalk\n\nthree children are sitting\non a bunk bed\n\ntwo young children are\nsitting on the bunk beds\n\nIn order to better understand the failure modes of our model, we manually reviewed a set of generated\ncaptions. We used our best performing model\u2014the Object Relation Transformer trained with self-\ncritical reinforcement learning\u2014with a beam size of 5 to generate captions for 100 randomly sampled\nimages from the MS-COCO\u2019s test set. For each generated caption, we described the errors and\nthen grouped them into distinct failure modes. An error was counted each time a term was wrong,\nextraneous, or missing. All errors were then tallied up, with each image being able to contribute with\nmultiple errors. There were a total of 62 observed errors, which were grouped into 4 categories: 58%\nof the errors pertained to objects or things, 21% to relations, 16% to attributes, and 5% to syntax.\nNote that while these failure modes are very similar to the semantic subcategories from SPICE, we\nwere not explicitly aiming to adhere to those. In addition, one general pattern that stood out were\nthe errors in identifying rare or unusual objects. Some examples of unusual objects that were not\ncorrectly identi\ufb01ed include: parking meter, clothing mannequin, umbrella hat, tractor, and masking\ntape. This issue is also noticeable, even if to a lesser degree, in rare relations and attributes. Another\ninteresting observation was that the generated captions tend to be less descriptive and less discursive\nthan the ground truth captions. The above results and observations can be used to help prioritize\nfuture efforts in image captioning.\n\n6 Conclusion\n\nWe have presented the Object Relation Transformer, a modi\ufb01cation of the conventional Transformer,\nspeci\ufb01cally adapted to the task of image captioning. The proposed Transformer encodes 2D position\nand size relationships between detected objects in images, building upon the bottom-up and top-\ndown image captioning approach. Our results on the MS-COCO dataset demonstrate that the\nTransformer does indeed bene\ufb01t from incorporating spatial relationship information, most evidently\nwhen comparing the relevant sub-metrics of the SPICE captioning metric. We have also presented\nqualitative examples of how incorporating this information can yield captioning results demonstrating\nbetter spatial awareness.\nAt present, our model only takes into account geometric information in the encoder phase. As a\nnext step, we intend to incorporate geometric attention in our decoder cross-attention layers between\nobjects and words. We aim to do this by explicitly associating decoded words with object bounding\nboxes. This should lead to additional performance gains as well as improved interpretability of the\nmodel.\n\nAcknowledgments\n\nThe authors would like to thank Ruotian Luo for making his image captioning code available on\nGitHub [16].\n\n9\n\n\fReferences\n[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. SPICE: Semantic propositional image\ncaption evaluation. In European Conference on Computer Vision, pages 382\u2013398. Springer,\n2016.\n\n[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and\ntop-down attention for image captioning and visual question answering. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[3] M. Denkowski and A. Lavie. Meteor universal: Language speci\ufb01c translation evaluation for\nany target language. In Proceedings of the ninth workshop on statistical machine translation,\npages 376\u2013380, 2014.\n\n[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional\n\ntransformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and\nT. Darrell. Long-term recurrent convolutional networks for visual recognition and description.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n2625\u20132634, 2015.\n\n[6] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll\u00e1r, J. Gao, X. He, M. Mitchell,\nJ. C. Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 1473\u20131482, 2015.\n\n[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object\ndetection and semantic segmentation. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 580\u2013587, 2014.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n770\u2013778, 2016.\n\n[9] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3588\u20133597, 2018.\n\n[10] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks\nfor dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4565\u20134574, 2016.\n\n[11] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3128\u20133137, 2015.\n\n[12] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In International\n\nConference on Machine Learning, pages 595\u2013603, 2014.\n\n[13] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization\n\nBranches Out, 2004.\n\n[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick.\nMicrosoft COCO: Common objects in context. In European conference on computer vision,\npages 740\u2013755. Springer, 2014.\n\n[15] D. Liu, Z.-J. Zha, H. Zhang, Y. Zhang, and F. Wu. Context-aware visual policy network for\nsequence-level image captioning. In Proceedings of the 26th ACM International Conference on\nMultimedia, MM \u201918, pages 1416\u20131424. ACM, 2018.\n\n[16] R. Luo. An image captioning codebase in PyTorch. https://github.com/ruotianluo/\n\nImageCaptioning.pytorch, 2017.\n\n[17] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal\n\nrecurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632, 2014.\n\n10\n\n\f[18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of\nmachine translation. In Proceedings of the 40th annual meeting on association for computational\nlinguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n[19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are\n\nunsupervised multitask learners. OpenAI Blog, 1:8, 2019.\n\n[20] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with\nregion proposal networks. In Advances in neural information processing systems, pages 91\u201399,\n2015.\n\n[21] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for\nimage captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 7008\u20137024, 2017.\n\n[22] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed,\nimage alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual\nMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages\n2556\u20132565, Melbourne, Australia, July 2018. Association for Computational Linguistics.\n\n[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\nI. Polosukhin. Attention is all you need. In Advances in neural information processing systems,\npages 5998\u20136008, 2017.\n\n[24] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. CIDEr: Consensus-based image description\nevaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 4566\u20134575, 2015.\n\n[25] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption\ngenerator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 3156\u20133164, 2015.\n\n[26] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio.\nShow, attend and tell: Neural image caption generation with visual attention. In International\nconference on machine learning, pages 2048\u20132057, 2015.\n\n[27] X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for image captioning.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n10685\u201310694, 2019.\n\n[28] Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. R. Salakhutdinov. Review networks for caption\ngeneration. In Advances in Neural Information Processing Systems, pages 2361\u20132369, 2016.\n\n[29] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In\n\nEuropean Conference on Computer Vision, pages 684\u2013699, 2018.\n\n[30] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n4651\u20134659, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5963, "authors": [{"given_name": "Simao", "family_name": "Herdade", "institution": "Yahoo Research"}, {"given_name": "Armin", "family_name": "Kappeler", "institution": "Apple"}, {"given_name": "Kofi", "family_name": "Boakye", "institution": "Yahoo Research"}, {"given_name": "Joao", "family_name": "Soares", "institution": "Yahoo Research"}]}