{"title": "Partially-Supervised Image Captioning", "book": "Advances in Neural Information Processing Systems", "page_first": 1875, "page_last": 1886, "abstract": "Image captioning models are becoming increasingly successful at describing the content of images in restricted domains. However, if these models are to function in the wild --- for example, as assistants for people with impaired vision --- a much larger number and variety of visual concepts must be understood. To address this problem, we teach image captioning models new visual concepts from labeled images and object detection datasets. Since image labels and object classes can be interpreted as partial captions, we formulate this problem as learning from partially-specified sequence data. We then propose a novel algorithm for training sequence models, such as recurrent neural networks, on partially-specified sequences which we represent using finite state automata. In the context of image captioning, our method lifts the restriction that previously required image captioning models to be trained on paired image-sentence corpora only, or otherwise required specialized model architectures to take advantage of alternative data modalities. Applying our approach to an existing neural captioning model, we achieve state of the art results on the novel object captioning task using the COCO dataset. We further show that we can train a captioning model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores.", "full_text": "Partially-Supervised Image Captioning\n\nPeter Anderson\n\nMacquarie University\u2217\nSydney, Australia\n\np.anderson@mq.edu.au\n\nStephen Gould\n\nMark Johnson\n\nMacquarie University\n\nSydney, Australia\n\nAustralian National University\n\nCanberra, Australia\n\nstephen.gould@anu.edu.au\n\nmark.johnson@mq.edu.au\n\nAbstract\n\nImage captioning models are becoming increasingly successful at describing the\ncontent of images in restricted domains. However, if these models are to function\nin the wild \u2014 for example, as assistants for people with impaired vision \u2014 a much\nlarger number and variety of visual concepts must be understood. To address this\nproblem, we teach image captioning models new visual concepts from labeled\nimages and object detection datasets. Since image labels and object classes can be\ninterpreted as partial captions, we formulate this problem as learning from partially-\nspeci\ufb01ed sequence data. We then propose a novel algorithm for training sequence\nmodels, such as recurrent neural networks, on partially-speci\ufb01ed sequences which\nwe represent using \ufb01nite state automata. In the context of image captioning, our\nmethod lifts the restriction that previously required image captioning models to be\ntrained on paired image-sentence corpora only, or otherwise required specialized\nmodel architectures to take advantage of alternative data modalities. Applying our\napproach to an existing neural captioning model, we achieve state of the art results\non the novel object captioning task using the COCO dataset. We further show that\nwe can train a captioning model to describe new visual concepts from the Open\nImages dataset while maintaining competitive COCO evaluation scores.\n\n1\n\nIntroduction\n\nThe task of automatically generating image descriptions, i.e., image captioning [1\u20133], is a long-\nstanding and challenging problem in arti\ufb01cial intelligence that demands both visual and linguistic\nunderstanding. To be successful, captioning models must be able to identify and describe in natural\nlanguage the most salient elements of an image, such as the objects present and their attributes, as\nwell as the spatial and semantic relationships between objects [3]. The recent resurgence of interest\nin this task has been driven in part by the development of new and larger benchmark datasets such as\nFlickr 8K [4], Flickr 30K [5] and COCO Captions [6]. However, even the largest of these datasets,\nCOCO Captions, is still based on a relatively small set of 91 underlying object classes. As a result,\ndespite continual improvements to image captioning models and ever-improving COCO caption\nevaluation scores [7\u201310], captioning models trained on these datasets fail to generalize to images\nin the wild [11]. This limitation severely hinders the use of these models in real applications, for\nexample as assistants for people with impaired vision [12].\nIn this work, we use weakly-annotated data (readily available in object detection datasets and labeled\nimage datasets) to improve image captioning models by increasing the number and variety of visual\nconcepts that can be successfully described. Compared to image captioning datasets such as COCO\nCaptions, several existing object detection datasets [14] and labeled image datasets [15, 16] are much\nlarger and contain many more visual concepts. For example, the recently released Open Images\ndataset V4 [14] contains 1.9M images human-annotated with object bounding boxes for 600 object\n\n\u2217Now at Georgia Tech (peter.anderson@gatech.edu)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Conceptual overview of partially-speci\ufb01ed sequence supervision (PS3) applied to image\ncaptioning. In Step 1 we construct \ufb01nite state automata (FSA) to represent image captions partially-\nspeci\ufb01ed by object annotations, and use constrained beam search (CBS) decoding [13] to \ufb01nd high\nprobability captions that are accepted by the FSA. In Step 2, we update the model parameters using\nthe completed sequences as a training targets.\n\nclasses, compared to the 165K images and 91 underlying object classes in COCO Captions. This\nre\ufb02ects the observation that, in general, object detection datasets may be easier to scale \u2014 possibly\nsemi-automatically [17, 18] \u2014 to new concepts than image caption datasets. Therefore, in order to\nbuild more useful captioning models, \ufb01nding ways to assimilate information from these other data\nmodalities is of paramount importance.\nTo train image captioning models on object detections and labeled images, we formulate the problem\nas learning from partially-speci\ufb01ed sequence data. For example, we might interpret an image labeled\nwith \u2018scooter\u2019 as a partial caption containing the word \u2018scooter\u2019 and an unknown number of other\nmissing words, which when combined with \u2018scooter\u2019 in the correct order constitute the complete\nsequence. If an image is annotated with the object class \u2018person\u2019, this may be interpreted to suggest\nthat the complete caption description must mention \u2018person\u2019. However, we may also wish to consider\ncomplete captions that reference the person using alternative words that are appropriate to speci\ufb01c\nimage contexts \u2014 such as \u2018man\u2019, \u2018woman\u2019, \u2018cowboy\u2019 or \u2018biker\u2019. Therefore, we characterize our\nuncertainty about the complete sequence by representing each partially-speci\ufb01ed sequence as a \ufb01nite\nstate automaton (FSA) that encodes which sequences are consistent with the observed data. FSA\nare widely used in natural language processing because of their \ufb02exibility and expressiveness, and\nbecause there are well-known techniques for constructing and manipulating such automata (e.g.,\nregular expressions can be compiled into FSA).\nGiven training data where the captions are either complete sequences or FSA representing partially-\nspeci\ufb01ed sequences, we propose a novel two-step algorithm inspired by expectation maximization\n(EM) [19, 20] to learn the parameters of a sequence model such as a recurrent neural network (RNN)\nwhich we will use to generate complete sequences at test time. As illustrated in Figure 1, in the \ufb01rst\nstep we use constrained beam search decoding [13] to \ufb01nd high probability complete sequences that\nsatisfy the FSA. In the second step, we learn or update the model parameters using the completed\ndataset. We dub this approach PS3, for partially-speci\ufb01ed sequence supervision. In the context of\nimage captioning, PS3 allows us to train captioning models jointly over both image caption and object\ndetection datasets. Our method thus lifts the restriction that previously required image captioning\nmodels to be trained on paired image-sentence corpora only, or otherwise required specialized model\narchitectures to be used in order to take advantage of other data modalities [21\u201324].\nConsistent with previous work [13, 21\u201324], we evaluate our approach on the COCO novel object\ncaptioning splits in which all mentions of eight selected object classes have been eliminated from the\ncaption training data. Applying PS3 to an existing open source neural captioning model [10], and\n\n2\n\nFSA constructionStep 1: Caption completionStep 2: Parameter updateAA manon a scooterriding down the road in front of some other traffic.CBS decoding(using current model parameters)person, bus, scooter, van, white, yellowmanontraffic.<EOS><SOS>Amanothertraffic.\fD3\n\nD3\n\nstart\n\ns0\n\na\n\ns1\n\n\u03a3\n\ns2\n\nc\n\ns3\n\nstart\n\ns0\n\nD1\n\ns1\n\nD1\n\ns2\n\ns3\n\n(a) A sequence (a, <unk>, c) from vocabulary \u03a3\nwhere the length of the missing subsequence <unk> is\nunknown.\n\nstart\n\ns0\n\nthe\n\nthe\n\ns1\n\nscore\n\ns2\n\nD2\n\nD2\n\nD2\n\nD2\n\nD1\n\ns4\n\ns5\n\nD1\n\ns6\n\ns7\n\nD3\n\nD3\n\n(b) A sequence that doesn\u2019t mention \u2018the score\u2019.\n\n(c) A sequence that mentions word(s) from at least\ntwo of the three disjunctive sets D1,D2 and D3.\n\nFigure 2: PS3 is a general approach to training RNNs on partially-speci\ufb01ed sequences. Here we\nillustrate some examples of partially-speci\ufb01ed sequences that can be represented with \ufb01nite state\nautomata. Unlabeled edges indicate \u2018default transitions\u2019, i.e., an unlabeled edge leaving a node n is\nimplicitly labeled with \u03a3 \\ S, where S is the set of symbols on labeled edges leaving n and \u03a3 is the\ncomplete vocabulary.\n\ntraining on auxiliary data consisting of either image labels or object annotations, we achieve state\nof the art results on this task. Furthermore, we conduct experiments training on the Open Images\ndataset, demonstrating that using our method a captioning model can be trained to identify new visual\nconcepts from the Open Images dataset while maintaining competitive COCO evaluation scores.\nOur main contributions are threefold. First, we propose PS3, a novel algorithm for training sequence\nmodels such as RNNs on partially-speci\ufb01ed sequences represented by FSA (which includes sequences\nwith missing words as a special case). Second, we apply our approach to the problem of training\nimage captioning models from object detection and labeled image datasets, enabling arbitrary image\ncaptioning models to be trained on these datasets for the \ufb01rst time. Third, we achieve state of the art\nresults for novel object captioning, and further demonstrate the application of our approach to the\nOpen Images dataset. To encourage future work, we have released our code and trained models via\nthe project website2. As illustrated by the examples in Figure 2, PS3 is a general approach to training\nsequence models that may be applicable to various other problem domains with partially-speci\ufb01ed\ntraining sequences.\n\n2 Related work\n\nImage captioning The problem of image captioning has been intensively studied. More recent\napproaches typically combine a pretrained Convolutional Neural Network (CNN) image encoder with\na Recurrent Neural Network (RNN) decoder that is trained to predict the next output word, conditioned\non the previous output words and the image [1, 25\u201328], optionally using visual attention [2, 7\u201310].\nLike other sequence-based neural networks [29\u201332], these models are typically decoded by searching\nover output sequences either greedily or using beam search. As outlined in Section 3, our proposed\npartially-supervised training algorithm is applicable to this entire class of sequence models.\n\nNovel object captioning A number of previous works have studied the problem of captioning\nimages containing novel objects (i.e., objects not present in training captions) by learning from image\nlabels. Many of the proposed approaches have been architectural in nature. The Deep Compositional\nCaptioner (DCC) [21] and the Novel Object Captioner (NOC) [22] both decompose the captioning\nmodel into separate visual and textual pipelines. The visual pipeline consists of a CNN image\nclassi\ufb01er that is trained to predict words that are relevant to an image, including the novel objects.\nThe textual pipeline is a RNN trained on language data to estimate probabilities over word sequences.\nEach pipeline is pre-trained separately, then \ufb01ne-tuned jointly using the available image and caption\ndata. More recently, approaches based on constrained beam search [13], word copying [33] and\n\n2www.panderson.me/constrained-beam-search\n\n3\n\n\fneural slot-\ufb01lling [24] have been proposed to incorporate novel word predictions from an image\nclassi\ufb01er into the output of a captioning model. In contrast to the specialized architectures previously\nproposed for handling novel objects [21\u201324], we present a general approach to training sequence\nmodels on partially-speci\ufb01ed data that uses constrained beam search [13] as a subroutine.\n\nSequence learning with partial supervision Many previous works on semi-supervised sequence\nlearning focus on using unlabeled sequence data to improve learning, for example by pre-training\nRNNs [34, 35] or word embeddings [36, 37] on large text corpora. Instead, we focus on the scenario\nin which the sequences are incomplete or only partially-speci\ufb01ed, which occurs in many practical\napplications ranging from speech recognition [38] to healthcare [39]. To the best of our knowledge\nwe are the \ufb01rst to consider using \ufb01nite state automata as a new way of representing labels that strictly\ngeneralizes both complete and partially-speci\ufb01ed sequences.\n\n3 Partially-speci\ufb01ed sequence supervision (PS3)\n\nIn this section, we describe how partially-speci\ufb01ed data can be incorporated into the training of a\nsequence prediction model. We assume a model parameterized by \u03b8 that represents the distribution\nover complete output sequences y = (y1, . . . , yT ), y \u2208 Y as a product of conditional distributions:\n\nT(cid:89)\n\np\u03b8(y) =\n\np\u03b8(yt | y1:t\u22121)\n\n(1)\n\nt=1\n\nwhere each yt is a word or other token from vocabulary \u03a3. This model family includes recurrent neural\nnetworks (RNNs) and auto-regressive convolutional neural networks (CNNs) [29] with application to\ntasks such as language modeling [30], machine translation [31, 32], and image captioning [1\u20133]. We\nfurther assume that we have a dataset of partially-speci\ufb01ed training sequences X = {x0, . . . , xm},\nand we propose an algorithm that simultaneously estimates the parameters of the model \u03b8 and the\ncomplete sequence data Y .\n\n3.1 Finite state automaton speci\ufb01cation for partial sequences\n\nTraditionally partially-speci\ufb01ed data X is characterized as incomplete data containing missing\nvalues [19, 40], i.e., some sequence elements are replaced by an unknown word symbol <unk>.\nHowever, this formulation is insuf\ufb01ciently \ufb02exible for our application, so we propose a more general\nrepresentation that encompasses missing values as a special case. We represent each partially-\nspeci\ufb01ed sequence xi \u2208 X with a \ufb01nite state automaton (FSA) Ai that recognizes sequences that are\nconsistent with the observed partial information. Formally, Ai = (\u03a3, Si, si\n0, \u03b4i, F i) where \u03a3 is the\n0 \u2208 Si is the initial state, \u03b4i : Si \u00d7 \u03a3 \u2192 Si is\nmodel vocabulary, Si is the set of automaton states, si\nthe state-transition function that maps states and words to states, and F i \u2286 Si is the set of \ufb01nal or\naccepting states [41].\nAs illustrated in Figure 2, this approach can encode very expressive uncertainties about the partially-\nspeci\ufb01ed sequence. For example, we can allow for missing subsequences of unknown or bounded\nlength, negative information, and observed constraints in the form of conjunctions of disjunctions or\npartial orderings. Given this \ufb02exibility, from a modeling perspective the key challenge in implement-\ning the proposed approach will be determining the appropriate FSA to encode the observed partial\ninformation. We discuss this further from the perspective of image captioning in Section 4.\n\n3.2 Training algorithm\n\nWe now present a high level speci\ufb01cation of the proposed PS3 training algorithm. Given a dataset of\npartially-speci\ufb01ed training sequences X and current model parameters \u03b8, then iteratively perform the\nfollowing two steps:\nStep 1. Estimate the complete data Y by setting yi \u2190 argmaxy p\u03b8(y | Ai) for all xi \u2208 X\nStep 2. Learn the model parameters by setting \u03b8 \u2190 argmax\u03b8\nStep 1 can be skipped for complete sequences, but for partially-speci\ufb01ed sequences Step 1 requires\nus to \ufb01nd the most likely output sequence that satis\ufb01es the constraints speci\ufb01ed by an FSA. As\nit is typically computationally infeasible to solve this problem exactly, we use constrained beam\n\ny\u2208Y log p\u03b8(y)\n\n(cid:80)\n\n4\n\n\fB \u2190 {\u0001}\nfor t = 1, . . . , T do\n\nAlgorithm 1 Beam search decoding\n1: procedure BS(\u0398, b, T, \u03a3)\n2:\n3:\n4:\n5:\n6:\n\nE \u2190 {(y, w) | y \u2208 B, w \u2208 \u03a3}\nB \u2190 argmaxE(cid:48)\u2282E,|E(cid:48)|=b\n\nreturn argmaxy\u2208B \u0398(y)\n\n(cid:80)\n\ny\u2208E(cid:48) \u0398(y)\n\nfor s \u2208 S do\n\nAlgorithm 2 Constrained beam search decoding [13]\n1: procedure CBS(\u0398, b, T, A = (\u03a3, S, s0, \u03b4, F ))\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\nBs \u2190 {\u0001} if s = s0 else \u2205\nfor s \u2208 S do\nreturn argmaxy\u2208(cid:83)\n\nEs \u2190 \u222as(cid:48)\u2208S{(y, w) | y \u2208 Bs(cid:48)\nBs \u2190 argmaxE(cid:48)\u2282Es,|E(cid:48)|=b\n\nfor t = 1, . . . , T do\n\n(cid:80)\n\n(cid:46) With beam size b and vocabulary \u03a3\n(cid:46) \u0001 is the empty string\n\n(cid:46) All one-word extensions of sequences in B\n(cid:46) The b most probable extensions in E\n(cid:46) The most probable sequence\n\n(cid:46) With \ufb01nite state recognizer A\n\n(cid:46) Each state s has a beam Bs\n\n(cid:46) Extend sequences through state-transition function \u03b4\n, w \u2208 \u03a3, \u03b4(s(cid:48), w) = s}\ny\u2208E(cid:48) \u0398(y)\n\n(cid:46) The b most probable extensions in Es\n(cid:46) The most probable accepted sequence\n\ns\u2208F Bs \u0398(y)\n\nsearch [13] to \ufb01nd an approximate solution. In Algorithms 1 and 2 we provide an overview of\nthe constrained beam search algorithm, contrasting it with beam search [42]. Both algorithms take\nas inputs a scoring function which we de\ufb01ne by \u0398(y) = log p\u03b8(y), a beam size b, the maximum\nsequence length T and the model vocabulary \u03a3. However, the constrained beam search algorithm\nadditionally takes a \ufb01nite state recognizer A as input, and guarantees that the sequence returned will\nbe accepted by the recognizer. Refer to Anderson et al. [13] for a more complete description of\nconstrained beam search. We also note that other variations of constrained beam search decoding\nhave been proposed [43\u201345]; we leave it to future work to determine if they could be used here.\n\nOnline version The PS3 training algorithm, as presented so far, is inherently of\ufb02ine. It requires\nmultiple iterations through the training data, which can become impractical with large models and\ndatasets. However, our approach can be adapted to an online implementation. For example, when\ntraining neural networks, Steps 1 and 2 can be performed for each minibatch, such that Step 1\nestimates the complete data for the current minibatch Y (cid:48) \u2282 Y , and Step 2 performs a gradient update\nbased on Y (cid:48). In terms of implementation, Steps 1 and 2 can be implemented in separate networks\nwith tied weights, or in a single network by backpropagating through the resulting search tree in the\nmanner of Wiseman and Rush [46]. In our GPU-based implementation, we use separate networks\nwith tied weights. This is more memory ef\ufb01cient when the number of beams b and the number of\nstates |S| is large, because performing the backward pass in the smaller Step 2 network means that it\nis not necessary to maintain the full unrolled history of the search tree in memory.\n\nComputational complexity Compared to training on complete sequence data, PS3 performs addi-\ntional computation to \ufb01nd a high-probability complete sequence for each partial sequence speci\ufb01ed\nby an FSA. Because constrained beam search maintains a beam of b sequences for each FSA state,\nthis cost is given by |S| \u00b7 b \u00b7 \u03b3, where |S| is the number of FSA states, b is the beam size parameter,\nand \u03b3 is the computational cost of a single forward pass through an unrolled recurrent neural network\n(e.g., the cost of decoding a single sequence). Although the computational cost of training increases\nlinearly with the number of FSA states, for any particular application FSA construction is a modeling\nchoice and there are many existing FSA compression and state reduction methods available.\n\n4 Application to image captioning\n\nIn this section, we describe how image captioning models can be trained on object annotations and\nimage tags by interpreting these annotations as partially-speci\ufb01ed caption sequences.\n\n5\n\n\fCaptioning model For image captioning experiments we use the open source bottom-up and top-\ndown attention captioning model [10], which we refer to as Up-Down. This model belongs to the\nclass of \u2018encoder-decoder\u2019 neural architectures and recently achieved state of the art results on the\nCOCO test server [6]. The input to the model is an image, I. The encoder part of the model consists\nof a Faster R-CNN object detector [47] based on the ResNet-101 CNN [48] that has been pre-trained\non the Visual Genome dataset [49]. Following the methodology in Anderson et al. [10], the image I\nis encoded as a set of image feature vectors, V = {v1, . . . , vk}, vi \u2208 RD, where each vector vi is\nassociated with an image bounding box. The decoder part of the model consists of a 2-layer Long\nShort-Term Memory (LSTM) network [50] combined with a soft visual attention mechanism [2]. At\neach timestep t during decoding, the decoder takes as input an encoding of the previously generated\nword given by We\u03a0t, where We \u2208 RM\u00d7|\u03a3| is a word embedding matrix for a vocabulary \u03a3 with\nembedding size M, and \u03a0t is one-hot encoding of the input word at timestep t. The model outputs a\nconditional distribution over the next word output given by p(yt | y1:t\u22121) = softmax (Wpht + bp),\nwhere ht \u2208 RN is the LSTM output and Wp \u2208 R|\u03a3|\u00d7N and bp \u2208 R|\u03a3| are learned weights and\nbiases. The decoder represents the distribution over complete output sequences using Equation 1.\n\nFinite state automaton construction To train image captioning models on datasets of object\ndetections and labeled images, we construct \ufb01nite state automata as follows. At each training iteration\nwe select three labels at random from the labels assigned to each image. Each of the three selected\nlabels is mapped to a disjunctive set Di containing every word in the vocabulary \u03a3 that shares\nthe same word stem. For example, the label bike maps to { bike, bikes, biked, biking }.\nThis gives the captioning model the freedom to choose word forms. As the selected image labels\nmay include redundant synonyms such as bike and bicycle, we only enforce that the generated\ncaption mentions at least two of the three selected image labels. We therefore construct a \ufb01nite state\nautomaton that accepts strings that contain at least one word from at least two of the disjunctive sets.\nAs illustrated in Figure 2(c), the resulting FSA contains eight states (although the four accepting\nstates could be collapsed into one). In initial experiments we investigated several variations of this\nsimple construction approach (e.g., randomly selecting two or four labels, or requiring more or fewer\nof the selected labels to be mentioned in the caption). These alternatives performed slightly worse\nthan the approach described above. However, we leave a detailed investigation of more sophisticated\nmethods for constructing \ufb01nite state automata encoding observed partial information to future work.\n\nOut-of-vocabulary words One practical consideration when training image captioning models on\ndatasets of object detections and labeled images is the presence of out-of-vocabulary words. The\nconstrained decoding in Step 1 can only produce \ufb02uent sentences if the model can leverage some side\ninformation about the out-of-vocabulary words. To address this problem, we take the same approach\nas Anderson et al. [13], adding pre-trained word embeddings to both the input and output layers of\nthe decoder. Speci\ufb01cally, we initialize We with pretrained word embeddings, and add an additional\noutput layer such that vt = tanh (Wpht + bp) and p(yt | y1:t\u22121) = softmax (W T\ne vt). For the word\nembeddings, we concatenate GloVe [37] and dependency-based [51] embeddings, as we \ufb01nd that the\nresulting combination of semantic and functional context improves the \ufb02uency of the constrained\ncaptions compared to using either embedding on its own.\n\nImplementation details\nIn all experiments we initialize the model by training on the available\nimage-caption dataset following the cross-entropy loss training scheme in the Up-Down paper [10],\nand keeping pre-trained word embeddings \ufb01xed. When training on image labels, we use the online\nversion of our proposed training algorithm, constructing each minibatch of 100 with an equal number\nof complete and partially-speci\ufb01ed training examples. We use SGD with an initial learning rate\nof 0.001, decayed to zero over 5K iterations, with a lower learning rate for the pre-trained word\nembeddings. In beam search and constrained beam search decoding we use a beam size of 5. Training\n(after initialization) takes around 8 hours using two Titan X GPUs.\n\n5 Experiments\n\n5.1 COCO novel object captioning\n\nDataset splits To evaluate our proposed approach, we use the COCO 2014 captions dataset [52]\ncontaining 83K training images and 41K validation images, each labeled with \ufb01ve human-annotated\n\n6\n\n\fTable 1: Impact of training and decoding with image labels on COCO novel object captioning\nvalidation set scores. All experiments use the same \ufb01nite state automaton construction. On out-of-\ndomain images, imposing label constraints during training using PS3 always improves the model\n(row 3 vs. 1, 4 vs. 2, 6 vs. 5), and constrained beam search (CBS) decoding is no longer necessary\n(row 4 vs. 3). The model trained using PS3 and decoded with standard beam search (row 3) is closest\nto the performance of the model trained with the full set of image captions (row 7).\n\nTraining\nCBS\nCaptions Labels Labels\n\nPS3\n\n1\n2\n3\n4\n5\n6\n7\n\n(cid:71)(cid:35)\n(cid:71)(cid:35)\n(cid:71)(cid:35)\n(cid:71)(cid:35)\n(cid:71)(cid:35)\n(cid:71)(cid:35)\n(cid:32)\n\n(cid:78)\n\n(cid:78)\n(cid:70)\n(cid:70)\n\n(cid:32)\n(cid:32)\n(cid:32)\n\nOut-of-Domain Scores\nSPICE METEOR CIDEr\n14.4\n69.5\n74.8\n15.9\n94.3\n18.3\n92.5\n18.2\n82.5\n18.0\n20.1\n95.5\n111.5\n20.1\n\n22.1\n23.1\n25.5\n25.2\n24.5\n26.4\n27.0\n\nF1\n0.0\n26.9\n63.4\n62.4\n30.4\n65.0\n69.0\n\nIn-Domain Scores\n\nSPICE METEOR CIDEr\n19.9\n108.6\n102.4\n19.7\n101.2\n18.9\n99.5\n19.1\n109.7\n22.3\n21.7\n106.6\n109.5\n20.0\n\n26.5\n26.2\n25.9\n25.9\n27.9\n27.5\n26.7\n\n(cid:32) = full training set, (cid:71)(cid:35) = impoverished training set, (cid:78)= constrained beam search (CBS) decoding with\npredicted labels, (cid:70)= CBS decoding with ground-truth labels\n\nTable 2: Performance on the COCO novel object captioning test set. \u2018+ CBS\u2019 indicates that a model\nwas decoded using constrained beam search [13] to force the inclusion of image labels predicted by\nan external model. On standard caption metrics, our generic training algorithm (PS3) applied to the\nUp-Down [10] model outperforms all prior work.\n\nModel\nDCC [21]\nNOC [22]\nC-LSTM [23]\nLRCN + CBS [13]\nLRCN + CBS [13]\nNBT [24]\nNBT + CBS [24]\nPS3 (ours)\n\nCNN\nVGG-16\nVGG-16\nVGG-16\nVGG-16\nRes-50\nVGG-16\nRes-101\nRes-101\n\n-\n-\n\n15.9\n16.4\n15.7\n17.4\n17.9\n\nOut-of-Domain Scores\nSPICE METEOR CIDEr\n13.4\n59.1\n\nF1\n39.8\n48.8\n55.7\n54.0\n53.3\n48.5\n70.3\n63.0\n\n-\n-\n\n77.9\n77.6\n77.0\n86.0\n94.5\n\n21.0\n21.3\n23.0\n23.3\n23.6\n22.8\n24.1\n25.4\n\nIn-Domain Scores\n\nSPICE METEOR CIDEr\n15.9\n77.2\n\n23.0\n\n-\n-\n\n18.0\n18.4\n17.5\n18.0\n19.0\n\n-\n-\n\n24.5\n24.9\n24.3\n25.0\n25.9\n\n-\n-\n\n86.3\n88.0\n87.4\n92.1\n101.1\n\ncaptions. We use the splits proposed by Hendricks et al. [21] for novel object captioning, in which\nall images with captions that mention one of eight selected objects (including synonyms and plural\nforms) are removed from the caption training set, which is reduced to 70K images. The original\nCOCO validation set is split 50% for validation and 50% for testing. As such, models are required to\ncaption images containing objects that are not present in the available image-caption training data.\nFor analysis, we further divide the test and validation sets into their in-domain and out-of-domain\ncomponents. Any test or validation image with a reference caption that mentions a held-out object is\nconsidered to be out-of-domain. The held-out objects classes selected by Hendricks et al. [21], are\nBOTTLE, BUS, COUCH, MICROWAVE, PIZZA, RACKET, SUITCASE, and ZEBRA.\n\nImage labels As with zero-shot learning [53], novel object captioning requires auxiliary information\nin order to successfully caption images containing novel objects. In the experimental procedure\nproposed by Hendricks et al. [21] and followed by others [13, 22, 23], this auxiliary information is\nprovided in the form of image labels corresponding to the 471 most common adjective, verb and noun\nbase word forms extracted from the held-out training captions. Because these labels are extracted\nfrom captions, there are no false positives, i.e., all of the image labels are salient to captioning.\nHowever, the task is still challenging as the labels are pooled across \ufb01ve captions per image, with the\nnumber of labels per image ranging from 1 to 27 with a mean of 12.\n\nEvaluation To evaluate caption quality, we use SPICE [54], CIDEr [55] and METEOR [56]. We\nalso report the F1 metric for evaluating mentions of the held-out objects. The ground truth for an\nobject mention is considered to be positive if the held-out object is mentioned in any reference\n\n7\n\n\fzebra\n\nbus\n\ncouch\n\nmicrowave\n\nBaseline: A close up of\na giraffe with its head.\n\nOurs: A couple of zebra\nstanding next\nto each\nother.\n\nBaseline: A food truck\nparked on the side of a\nroad.\nOurs: A white bus driv-\ning down a city street.\n\nBaseline: A living room\n\ufb01lled with lots of furni-\nture.\nOurs: A brown couch\nsitting in a living room.\n\nBaseline: A picture of\nan oven in a kitchen.\n\nOurs: A microwave sit-\nting on top of a counter.\n\npizza\n\nracket\n\nsuitcase\n\nbottle\n\nBaseline: A collage of\nfour pictures of food.\n\nOurs: A set of pictures\nshowing a slice of pizza.\n\nBaseline: A young girl\nis standing in the tennis\ncourt.\nOurs: A little girl hold-\ning a tennis racket.\n\nBaseline: A group of\npeople walking down a\nstreet.\nOurs: A group of peo-\nple walking down a city\nstreet.\n\nBaseline: A woman in\nthe kitchen with a tooth-\nbrush in her hand.\nOurs: A woman wear-\ning a blue tie holding a\nyellow toothbrush.\n\nFigure 3: Examples of generated captions for images containing novel objects. The baseline Up-\nDown [10] captioning model performs poorly on images containing object classes not seen in the\navailable image-caption training data (top). Incorporating image labels for these object classes into\ntraining using PS3 allows the same model to produce \ufb02uent captions for the novel objects (bottom).\nThe last two examples may be considered to be failure cases (because the novel object classes,\nsuitcase and bottle, are not mentioned).\n\ncaptions. For consistency with previous work, out-of-domain scores are macro-averaged across the\nheld-out classes, and CIDEr document frequency statistics are determined across the entire test set.\n\nResults\nIn Table 1 we show validation set results for the Up-Down model with various combinations\nof PS3 training and constrained beam search decoding (top panel), as well as performance upper\nbounds using ground-truth data (bottom panel). For constrained beam search decoding, image label\npredictions are generated by a linear mapping from the mean-pooled image feature 1\ni=1 vi to\nk\nimage label scores which is trained on the entire training set. The results demonstrate that, on\nout-of-domain images, imposing the caption constraints during training using PS3 helps more than\nimposing the constraints during decoding. Furthermore, the model trained with PS3 has assimilated\nall the information available from the external image labeler, such that using constrained beam search\nduring decoding provides no additional bene\ufb01t (row 3 vs. row 4). Overall, the model trained on image\nlabels with PS3 (row 3) is closer in performance to the model trained with all captions (row 7) than it\nis to the baseline model (row 1). Evaluating our model (row 3) on the test set, we achieve state of the\nart results on the COCO novel object captioning task, as illustrated in Table 2. In Figure 3 we provide\nexamples of generated captions, including failure cases. In Figure 4 we visualize attention in the\nmodel (suggesting that image label supervision can successfully train a visual attention mechanism\nto localize new objects).\n\n(cid:80)k\n\n5.2 Preliminary experiments on Open Images\n\nOur primary motivation in this work is to extend the visual vocabulary of existing captioning models\nby making large object detection datasets available for training. Therefore, as a proof of concept\n\n8\n\n\fA woman holding a tennis racket on a court.\n\nFigure 4: To further explore the impact of training using PS3, we visualize attention in the Up-\nDown [10] model. As shown in this example, using only image label supervision (i.e., without\ncaption supervision) the model still learns to ground novel object classes (such as racket) in the\nimage.\n\ntiger\n\nmonkey\n\nrhino\n\nrabbit\n\nBaseline: A zebra is lay-\ning down in the grass.\n\nOurs: A tiger that is sit-\nting in the grass.\n\nBaseline: A black ele-\nphant laying on top of a\nwooden surface.\nOurs: A monkey that is\nsitting on the ground.\n\nBaseline: A man taking\na picture of an old car.\n\nBaseline: A cat that is\nlaying on the grass.\n\nOurs: A man sitting in\na car looking at an ele-\nphant.\n\nOurs: A squirrel that is\nsitting in the grass.\n\nFigure 5: Preliminary experiments on Open Images. As expected, the baseline Up-Down [10] model\ntrained on COCO performs poorly on novel object classes from the Open Images dataset (top).\nIncorporating image labels from 25 selected classes using PS3 leads to qualitative improvements\n(bottom). The last two examples are failure cases (but no worse than the baseline).\n\nwe train a captioning model simultaneously on COCO Captions [6] and object annotation labels\nfor 25 additional animal classes from the Open Images V4 dataset [14]. In Figure 5 we provide\nsome examples of the generated captions. We also evaluate the jointly trained model on the COCO\n\u2018Karpathy\u2019 val split [27], achieving SPICE, METEOR and CIDEr scores of 18.8, 25.7 and 103.5,\nrespectively, versus 20.1, 26.9 and 112.3 for the model trained exclusively on COCO.\n\n6 Conclusion\n\nWe propose a novel algorithm for training sequence models on partially-speci\ufb01ed data represented\nby \ufb01nite state automata. Applying this approach to image captioning, we demonstrate that a generic\nimage captioning model can learn new visual concepts from labeled images, achieving state of the art\nresults on the COCO novel object captioning splits. We further show that we can train the model to\ndescribe new visual concepts from the Open Images dataset while maintaining competitive COCO\nevaluation scores. Future work could investigate training captioning models on \ufb01nite state automata\nconstructed from scene graph and visual relationship annotations, which are also available at large\nscale [14, 49].\n\n9\n\n\fAcknowledgments\n\nThis research was supported by a Google award through the Natural Language Understanding Focused\nProgram, CRP 8201800363 from Data61/CSIRO, and under the Australian Research Council\u2019s\nDiscovery Projects funding scheme (project number DP160102156). We also thank the anonymous\nreviewers for their valuable comments that helped to improve the paper.\n\nReferences\n[1] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\n\nimage caption generator. In CVPR, 2015.\n\n[2] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov,\nRichard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation\nwith visual attention. In ICML, 2015.\n\n[3] Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Srivastava, Li Deng, Piotr Dollar, Jianfeng\nGao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig.\nFrom captions to visual concepts and back. In CVPR, 2015.\n\n[4] Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking\ntask: Data, models and evaluation metrics. Journal of Arti\ufb01cial Intelligence Research, 47:\n853\u2013899, 2013.\n\n[5] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions\nto visual denotations: New similarity metrics for semantic inference over event descriptions.\nTransactions of the Association for Computational Linguistics, 2:67\u201378, 2014.\n\n[6] Xinlei Chen, Tsung-Yi Lin Hao Fang, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar,\nand C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server.\narXiv preprint arXiv:1504.00325, 2015.\n\n[7] Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-\n\ncritical sequence training for image captioning. In CVPR, 2017.\n\n[8] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive\n\nattention via a visual sentinel for image captioning. In CVPR, 2017.\n\n[9] Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, and William W. Cohen. Review\n\nnetworks for caption generation. In NIPS, 2016.\n\n[10] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,\nand Lei Zhang. Bottom-up and top-down attention for image captioning and visual question\nanswering. In CVPR, 2018.\n\n[11] Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris\nBuehler, and Chris Sienkiewicz. Rich Image Captioning in the Wild. In IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR) Workshops, 2016.\n\n[12] Haley MacLeod, Cynthia L Bennett, Meredith Ringel Morris, and Edward Cutrell. Understand-\ning blind people\u2019s experiences with computer-generated captions of social media images. In\nProceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM,\n2017.\n\n[13] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary\n\nimage captioning with constrained beam search. In EMNLP, 2017.\n\n[14] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova,\nHassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, Serge Belongie, Victor Gomes,\nAbhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and\nKevin Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image\nclassi\ufb01cation. Dataset available from https://github.com/openimages, 2017.\n\n10\n\n\f[15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImagenet large scale visual recognition challenge. International Journal of Computer Vision\n(IJCV), 2015.\n\n[16] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas\nPoland, Damian Borth, and Li-Jia Li. YFCC100M: the new data in multimedia research.\nCommunications of the ACM, 59(2):64\u201373, 2016.\n\n[17] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking\n\nfor ef\ufb01cient object annotation. In ICCV, 2017.\n\n[18] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. We don\u2019t need no\nbounding-boxes: Training object class detectors using only human veri\ufb01cation. In CVPR, 2016.\n\n[19] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete\ndata via the em algorithm. Journal of the royal statistical society. Series B (methodological),\n1977.\n\n[20] Geoffrey J. McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions.\nWiley series in probability and statistics. Wiley, Hoboken, NJ, 2. ed edition, 2008. ISBN\n978-0-471-20170-0.\n\n[21] Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate\nSaenko, and Trevor Darrell. Deep Compositional Captioning: Describing Novel Object Cate-\ngories without Paired Training Data. In CVPR, 2016.\n\n[22] Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond J. Mooney, Trevor\n\nDarrell, and Kate Saenko. Captioning Images with Diverse Objects. In CVPR, 2017.\n\n[23] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Incorporating copying mechanism in image\n\ncaptioning for learning novel objects. In CVPR, 2017.\n\n[24] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Neural baby talk. In CVPR, 2018.\n\n[25] Jeffrey Donahue, Lisa A. Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venu-\ngopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for\nvisual recognition and description. In CVPR, 2015.\n\n[26] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. Deep captioning with\n\nmultimodal recurrent neural networks (m-rnn). In ICLR, 2015.\n\n[27] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image\n\ndescriptions. In CVPR, 2015.\n\n[28] Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig,\nand Margaret Mitchell. Language models for image captioning: The quirks and what works.\narXiv preprint arXiv:1505.01809, 2015.\n\n[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.\n\n[30] Alex Graves. Generating sequences with recurrent neural networks.\n\narXiv:1308.0850, 2013.\n\narXiv preprint\n\n[31] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural\n\nnetworks. In NIPS, 2014.\n\n[32] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. In ICLR, 2015.\n\n[33] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Incorporating copying mechanism in image\n\ncaptioning for learning novel objects. In CVPR, 2017.\n\n[34] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In NIPS, 2015.\n\n11\n\n\f[35] Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of\n\nsentences from unlabelled data. arXiv preprint arXiv:1602.03483, 2016.\n\n[36] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[37] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for\n\nWord Representation. In EMNLP, 2014.\n\n[38] Shahla Parveen and Phil Green. Speech recognition with missing data using recurrent neural\n\nnets. In NIPS, 2002.\n\n[39] Zachary C. Lipton, David C. Kale, and Randall Wetzel. Modeling missing data in clinical time\n\nseries with RNNs. In Machine Learning for Healthcare, 2016.\n\n[40] Zoubin Ghahramani and Michael I Jordan. Supervised learning from incomplete data via an em\n\napproach. In NIPS, 1994.\n\n[41] Michael Sipser. Introduction to the Theory of Computation. Cengage Learning, 3rd edition,\n\n2012.\n\n[42] Philipp Koehn. Statistical Machine Translation. Cambridge University Press, New York, NY,\n\nUSA, 1st edition, 2010. ISBN 0521874157, 9780521874151.\n\n[43] Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation using grid\n\nbeam search. In ACL, 2017.\n\n[44] Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam allocation\n\nfor neural machine translation. arXiv preprint arXiv:1804.06609, 2018.\n\n[45] Kyle Richardson, Jonathan Berant, and Jonas Kuhn. Polyglot semantic parsing in APIs. In\n\nNAACL, 2018.\n\n[46] Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam-search opti-\n\nmization. In EMNLP, 2016.\n\n[47] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time\n\nobject detection with region proposal networks. In NIPS, 2015.\n\n[48] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[49] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie\nChen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual\ngenome: Connecting language and vision using crowdsourced dense image annotations. arXiv\npreprint arXiv:1602.07332, 2016.\n\n[50] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long Short-Term Memory. Neural Computation,\n\n1997.\n\n[51] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In ACL, 2014.\n[52] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick.\n\nMicrosoft COCO: Common Objects in Context. In ECCV, 2014.\n\n[53] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the\n\nugly. In CVPR, 2017.\n\n[54] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic\n\nPropositional Image Caption Evaluation. In ECCV, 2016.\n\n[55] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based\n\nimage description evaluation. In CVPR, 2015.\n\n[56] Alon Lavie and Abhaya Agarwal. Meteor: An automatic metric for MT evaluation with high\nlevels of correlation with human judgments. In Proceedings of the Annual Meeting of the\nAssociation for Computational Linguistics (ACL): Second Workshop on Statistical Machine\nTranslation, 2007.\n\n12\n\n\f", "award": [], "sourceid": 932, "authors": [{"given_name": "Peter", "family_name": "Anderson", "institution": "Georgia Tech"}, {"given_name": "Stephen", "family_name": "Gould", "institution": "ANU"}, {"given_name": "Mark", "family_name": "Johnson", "institution": "Macquarie University"}]}