{"title": "Variational Structured Semantic Inference for Diverse Image Captioning", "book": "Advances in Neural Information Processing Systems", "page_first": 1931, "page_last": 1941, "abstract": "Despite the exciting progress in image captioning, generating diverse captions for a given image remains as an open problem. Existing methods typically apply generative models such as Variational Auto-Encoder to diversify the captions, which however neglect two key factors of diverse expression, i.e., the lexical diversity and the syntactic diversity. To model these two inherent diversities in image captioning, we propose a Variational Structured Semantic Inferring model (termed VSSI-cap) executed in a novel structured encoder-inferer-decoder schema. VSSI-cap mainly innovates in a novel structure, i.e., Variational Multi-modal Inferring tree (termed VarMI-tree). In particular, conditioned on the visual-textual features from the encoder, the VarMI-tree models the lexical and syntactic diversities by inferring their latent variables (with variations) in an approximate posterior inference guided by a visual semantic prior. Then, a reconstruction loss and the posterior-prior KL-divergence are jointly estimated to optimize the VSSI-cap model. Finally, diverse captions are generated upon the visual features and the latent variables from this structured encoder-inferer-decoder model. Experiments on the benchmark dataset show that the proposed VSSI-cap achieves significant improvements over the state-of-the-arts.", "full_text": "Variational Structured Semantic Inference for\n\nDiverse Image Captioning\n\nFuhai Chen1, Rongrong Ji12\u2217, Jiayi Ji1, Xiaoshuai Sun1, Baochang Zhang3, Xuri Ge1,\n\nYongjian Wu4, Feiyue Huang4, Yan Wang5\n\n1Department of Arti\ufb01cial Intelligence, School of Informatics, Xiamen University,\n\n2Peng Cheng Lab, 3Beihang University, 4Tencent Youtu Lab, 5Pinterest\n\n{cfh3c.xmu,jjyxmu,xurigexmu}@gmail.com, {rrji,xssun}@xmu.edu.cn, bczhang@buaa.edu.cn,\n\n{littlekenwu,garyhuang}@tencent.com, yanw@pinterest.com\n\nAbstract\n\nDespite the exciting progress in image captioning, generating diverse captions\nfor a given image remains as an open problem. Existing methods typically ap-\nply generative models such as Variational Auto-Encoder to diversify the captions,\nwhich however neglect two key factors of diverse expression, i.e., the lexical di-\nversity and the syntactic diversity. To model these two inherent diversities in im-\nage captioning, we propose a Variational Structured Semantic Inferring model\n(termed VSSI-cap) executed in a novel structured encoder-inferer-decoder schema.\nVSSI-cap mainly innovates in a novel structure, i.e., Variational Multi-modal In-\nferring tree (termed VarMI-tree). In particular, conditioned on the visual-textual\nfeatures from the encoder, the VarMI-tree models the lexical and syntactic diver-\nsities by inferring their latent variables (with variations) in an approximate poste-\nrior inference guided by a visual semantic prior. Then, a reconstruction loss and\nthe posterior-prior KL-divergence are jointly estimated to optimize the VSSI-cap\nmodel. Finally, diverse captions are generated upon the visual features and the\nlatent variables from this structured encoder-inferer-decoder model. Experiments\non the benchmark dataset show that the proposed VSSI-cap achieves signi\ufb01cant\nimprovements over the state-of-the-arts.\n\n1 Introduction\nImage captioning has recently attracted extensive research attention with broad application prospects.\nMost state-of-the-art image captioning models adopt an encoder-decoder architecture [1, 2, 3], which\nencodes the image into a feature representation via Convolutional Neural Network (CNN) and then\ndecodes the feature into a caption via Recurrent Neural Networks with Long-Short Term Memory\nunits (LSTM). Despite the exciting progress, one common defect is that the generated captions\nare semantically synonymous and syntactically similar, which goes against the inherent diversity\ndelivered by the image, i.e., \u201cA picture is worth a thousand words\u201d. Nevertheless, generating diverse\ncaptions from a given image remains as an open problem. As shown in Fig. 1 (Left-Top), it is quite\nintuitive to derive heterogeneous understanding from human being, while the traditional models\ntypically tend to generate homogeneous sentences due to the limited variation in the maximum\nlikelihood objective [4].\nSeveral recent works have been proposed to investigate diverse image captioning [5, 6, 7, 8, 9], which\ntypically employed a Generative Adversarial Network (GAN) or Variational Auto-Encoder (VAE) as\nthe generative model. For example, [5] designed an adversarial model trained with an approximate\nsampler to implicitly match the generated distribution to the human caption. For another instance,\n[8] proposed a conditional VAE based captioning model guided by an object-wise prior, as roughly\n\n\u2217\n\nCorresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustration of diverse image captioning. Left: Captions generated by traditional image captioning\nmodel (Left-Top), state-of-the-art generative model (Left-Middle), and our scheme that explicitly models lexi-\ncal and syntactic diversities (Left-Bottom) for diverse image captioning. Right: Captions with higher diversity\nare generated when the lexical (light blue) and syntactic (purple) diversities are considered.\n\nshown in Fig. 1 (Left-Middle). However, all these methods treated diverse image captioning as a\nblackbox without explicitly modeling the key factors to diversify the expression, i.e., the lexical and\nsyntactic diversities, as revealed in the natural language research [10, 11, 12], which in principle\ninvolves identifying content entities and then expressing their relationships. Fig. 1 (Left-Bottom\nand Right) shows an example of the lexical and syntactic diversities, both of which should be taken\ninto account for generating diverse image captions.\nIn this paper, we aim at explicitly modeling the lexical and syntactic diversities from the visual\ncontent towards diversi\ufb01ed image caption generation. To this end, we tackle two fundamental chal-\nlenges, i.e., diversity modeling and diversity embedding. For diversity modeling, we infer the lex-\nical and syntactic variables from the visual content by leveraging the visual parsing tree (VP-tree)\n[13, 14, 15, 16], which predicts the probability distributions of the lexical and syntactic categories\nto weight the latent variables in variational inferences. For diversity embedding, we advance the\ncommonly-used encoder-decoder scheme into a new structured encoder-inferer-decoder scheme,\nwhere the aforementioned variational inference is treated as an inferer and its outputs, i.e., the lexi-\ncal and syntactic latent variables (with variations), are sampled together with visual features to feed\na LSTM-based caption generator.\nIn particular, we propose a novel Variational Structured Semantic Inferring model for diverse image\ncaptioning, termed VSSI-cap as illustrated in Fig. 2, which is deployed over VAE2 to model and\nembed the lexical and syntactic diversities. In general, towards diversity modeling, such diversities\nare inferred in the designed variational multi-modal inferring tree (termed, VarMI-tree). Towards\ndiversity embedding, such diversities are integrated into diverse image captioning in a new structured\nencoder-inferer-decoder scheme. In particular, the proposed model contains three components: 1)\nencoder: Given an image and its corresponding caption, the visual and textual features are extracted\nby CNN and a word embedding model, respectively. 2) inferer: Inspired by the recent work in visual\nsemantic parsing [13], a VarMI-tree is proposed to infer the latent variables with variations for the\nlexical and syntactic diversities. 3) decoder: The visual feature, the inferred lexical and syntactic\nvariables (from posterior/prior inference), are decoded to output the caption by using LSTM.\nThe contributions of this paper are as follows: 1) We are the \ufb01rst to explicitly model diverse im-\nage captioning based on the lexical and syntactic diversities. We address two key issues in diverse\ncaptioning, i.e. diversity modeling and diversity embedding. 2) For diversity modeling, we pro-\npose a novel variational multi-modal inferring tree (VarMI-tree) to model the lexical and syntactic\ndiversities. 3) For diversity embedding, we propose a structured encoder-inferer-decoder scheme\nwhich explicitly integrates the lexical and syntactic diversities in caption generation. 4) The pro-\nposed VSSI-Cap beats the state-of-the-arts [5, 8] on the MSCOCO benchmark dataset in terms of\nboth accuracy metrics and diversity metrics.\n2 Preliminary\nImage Captioning. We adopt an encoder-decoder architecture as the basic image captioning\nmodel, where CNN is employed to encode an image I into a deep visual feature v and LSTM\nis used to decode this visual feature into a caption S. Many state-of-the-art methods [2, 3]\nadopt a maximum likelihood principle to train the models by using the image-caption pair set\n\n2Compared to other generative models, VAE can represent richer latent variables, which can also be trained\n\nmore easily.\n\n2\n\nmana man with a hat and a bird in his handa man is sitting on a bench with a birda man is holding a parrot in his handa man is sitting on a chair with a parrota man is holding a white parrot in handhatwithandbirdhandinPREPCONJNOUNPREPmanholdinhand(cid:28742)VERBPREPbirdmansitwithbirdVERBPREPchairOur scheme modeling lexicon & syntaxa man is sitting on a bench with a birda man is sitting on a bench next to a birdTraditional modellessdiversediversemore diversedifferent lexicon, similar syntaxambiguous variationNOUNNOUNNOUNNOUNNOUNNOUNNOUNNOUNNOUN(cid:28742)(cid:28742)(cid:28742)(cid:28742)(cid:28742)(cid:28742)(cid:28742)Encoder-DecoderTree 1Tree 2Tree 3Visual semantic parsingTree 1Tree 2Tree 3similar lexicon, different syntaxRecent generative model\fFigure 2: Overview of the proposed VSSI-cap model for diverse image captioning, which consists of encoder,\ninferer, and decoder. Given an image and its corresponding caption during training, visual feature v and\ntextual feature e are extracted from CNN and word embedding model respectively in encoder (brief in Sec.\n3.2). In inferer (Sec. 3.2), to represent the lexical/syntactic diversity, a VarMI-tree is designed to infer the\nlatent lexical/syntactic variable z(\u2113)/z(s) upon an additive Gaussian distribution in each node, where the means\n(cid:22)1:K and the square deviations (sds) (cid:27)1:K over different lexical/syntactic components are parameterized upon\nthe node feature h, and subsequently weighted by the corresponding probability distributions c1:K from the\nVP-tree (Sec. 3.1) for the additive Gaussian parameters (cid:22) and (cid:27). In decoder (Sec. 3.3), ~z is sampled from the\n\u2032\nposterior inference and is used for training while ~z\nis sampled from the prior inference (similar to the posterior\ninference but with (cid:22)1:K and (cid:27)1:K initialized randomly, detailed in Sec. 3.2) and is used to generate captions.\n\u2032\nFinally, v and ~z/~z\n\nare fed into LSTM for the sequential caption outputs.\n\nDp = {I (i), S(i) = {S(i)\nrespectively. The corresponding objective function can be formulated as follows:\n\n}Np\ni=0, where Np and T denote the pair number and the caption length,\n\nt\n\n}T (i)\n\nt=0\n\n\u2211Np\n\n\u2211Ti\n\ni=0\n\nt=0\n\n(\n\n)\n\nlog P (S|I) =\n\n1\nNp\n\nlog p\n\n|v(i), S(i)\n\n0:t\u22121\n\nS(i)\nt\n\n.\n\n(1)\n\nHowever, the above schemes are unsuitable for generating multiple diverse caption candidates due\nto the certainty of encoding in Eq. 1. Therefore, generative models, such as GAN and VAE, are\ntypically exploited to handle diverse image captioning [5, 6, 7, 8]. Other related topics include:\npersonalized expression [17, 18], stylistic description [19, 20], online context-aware heuristic search\n[21, 22], and word-speci\ufb01c discriminative captioning [15] etc.\n\nVariational Auto-Encoder (VAE). We brie\ufb02y present the variational auto-encoder (VAE) [23, 24]\nand its conditional variant [25, 26], which serves as the fundamental framework of the proposed\nstructured encoder-inferer-decoder scheme. Given an observed variable x, VAEs aim at modeling\nthe data likelihood p(x) based on the assumption that x is generated from a latent variable z, i.e.,\nthe decoder p(x|z), which is typically estimated via deep nets. Since the posterior inference p(z|x)\nis not computably tractable, it is approximated with a posterior inference q(z|x) that is typically a\ndiagonal Gaussian N (\u00b5, diag(\u03c32)), where the mean \u00b5 and the square deviation \u03c3 can be parameter-\nized in deep nets and serve as the encoder3. Thus, the encoder/inferer and decoder can be optimized\nby maximizing the following lower bound:\n\nLVAE(\u03b8, \u03d5; x, c) = Eq\u03d5(z|x,c)\n\nlog p\u03b8(x|z, c)\n\nq\u03d5(z|x, c)\u2225p\u03b8(z|c)\n\n[\n\n] \u2212 DKL\n\n(\n\n) \u2264 log p\u03b8(x),\n\n(2)\n\nwhere E and DKL are the approximate expectation and Kullback-Leibler (KL) divergence, respec-\ntively. c denotes the condition, which exists in the case of conditional VAE (CVAE). \u03d5 and \u03b8 denote\nthe parameters of the inferer and the decoder (e.g., LSTM), respectively. For diverse image cap-\ntioning, it\u2019s a straightforward thinking to represent the visual feature and the caption with c and x\nrespectively in a VAE model. However, the latent variable z in such VAE model has a very gen-\neral prior (standard Gaussians), which does not consider any domain-speci\ufb01c knowledge. We argue\nthat it may waste model capacity, and one should consider the unique problem structures of image\ncaptioning instead of using the VAE as is.\n\n3\n\nchildcar seatin(cid:2246)(cid:2869)(cid:4666)(cid:3039)(cid:4667)(cid:3037)(cid:2246)(cid:3012)(cid:4666)(cid:2194)(cid:4667)(cid:4666)(cid:3039)(cid:4667)(cid:3037)(cid:28663)(cid:2252)(cid:4666)(cid:2194)(cid:4667)(cid:3037)a young child in the back car seat talks on a phoneLSTMDecoderGeneration: Sample on prior(cid:2246)(cid:4666)(cid:2194)(cid:4667)(cid:3037)(cid:1826)(cid:4666)(cid:3039)(cid:4667)(cid:3037)(cid:2252)(cid:2869)(cid:4666)(cid:3039)(cid:4667)(cid:3037)(cid:2252)(cid:3012)(cid:4666)(cid:2194)(cid:4667)(cid:4666)(cid:3039)(cid:4667)(cid:3037)(cid:1808)(cid:3037)(cid:2246)(cid:2869)(cid:4666)(cid:3046)(cid:4667)(cid:3037)(cid:2246)(cid:3012)(cid:4666)(cid:2201)(cid:4667)(cid:4666)(cid:3046)(cid:4667)(cid:3037)(cid:28663)(cid:2252)(cid:4666)(cid:2201)(cid:4667)(cid:3037)(cid:2246)(cid:4666)(cid:2201)(cid:4667)(cid:3037)(cid:1826)(cid:4666)(cid:2201)(cid:4667)(cid:3037)(cid:2252)(cid:2869)(cid:4666)(cid:3046)(cid:4667)(cid:3037)(cid:2252)(cid:3012)(cid:4666)(cid:2201)(cid:4667)(cid:4666)(cid:3046)(cid:4667)(cid:3037)(cid:3556)(cid:1826)(cid:4593)CNNSubject 1Object 1Subject 2Object 2Sub-relation 1Sub-relation 2Root-Relationtalk onNULLphoneNULLEncoderInferer(VariationalMulti-modal Inferring Tree: VarMI-tree)Lexical variableSyntactic variableplayinPREPVERBParsing & EmbeddingPosterior inference in the j-thnode(cid:1803)(cid:4666)(cid:2194)(cid:4667)(cid:3037)(cid:1803)(cid:4666)(cid:2201)(cid:4667)(cid:3037)(cid:1803)(cid:3012)(cid:4666)(cid:2194)(cid:4667)(cid:4666)(cid:3039)(cid:4667)(cid:3037)(cid:1803)(cid:2869)(cid:4666)(cid:3039)(cid:4667)(cid:3037)Probabilitydistributions from VP-treeTraining:Sample onposterior(cid:3556)(cid:1826)sequentialoutput(cid:1822)(cid:1805)SI(cid:1803)(cid:3012)(cid:4666)(cid:2201)(cid:4667)(cid:4666)(cid:3046)(cid:4667)(cid:3037)(cid:1803)(cid:2869)(cid:4666)(cid:3046)(cid:4667)(cid:3037)\fNotation\nI\nS\nv\nej\nhj\nz(\u2113)j/z(s)j\nc(\u2113)j/c(s)j\n(cid:22)(\u2113)j/(cid:22)(s)j\n(cid:27)(\u2113)j/(cid:27)(s)j\n/(cid:22)(s)j\n(cid:22)(\u2113)j\n/(cid:27)(s)j\n(cid:27)(\u2113)j\n\u03b8\n\u03d5(\u2113)/\u03d5(s)\n\u03c8\n\u2032\n\nk\n\nk\n\nk\n\nk\n\nTable 1: Main notations and their de\ufb01nitions.\n\nDe\ufb01nition\nan image\na caption\nthe visual feature\nthe j-node word embedding feature\nthe feature of the j-th node in VarMI-tree\nthe j-node lexical/syntactic latent variable\nthe j-node lexical(word\u2019s)/syntactic(POS\u2019s) probability distribution in VP-tree\nthe additive mean of the j-node lexical/syntactic posterior Gaussian distribution\nthe additive squ. dev. of the j-node lexical/syntactic posterior Gaussian distribution\nthe k-component mean of the j-node lexical/syntactic posterior Gaussian distribution\nthe k-component squ. dev. of the j-node lexical/syntactic posterior Gaussian distribution\nthe parameter set of the decoder\nthe lexical/syntactic parameter set of the inferer\nthe parameter set of VarMI-tree trunk\nthe mark for the prior\n\n3 The Proposed VSSI-Cap Model\nThe framework of the proposed VSSI-Cap model is illustrated in Fig. 2. Following Eq. 2, the model\nis in principle optimized by maximizing the lower bound on the log-likelihood of p\u03b8(S) as below:\n\n[\n\n]\nlog p\u03b8(S|z(\u2113), z(s), v, c(\u2113), c(s))\n\n)\n\nL(\u03b8, \u03d5(\u2113), \u03d5(s), \u03c8; S, v, c(\u2113), c(s)) = E\n\u2212 DKL\n\n(\nz(\u2113)\u223cq\nq\u03d5(\u2113),\u03c8(z(\u2113)|S, v, c(\u2113))\u2225p(z(\u2113)|c(\u2113))\n\n) \u2212 DKL\n\n\u03d5(\u2113) ,\u03c8\n\n(\n\n,z(s)\u223cq\n\n\u03d5(s),\u03c8\n\nq\u03d5(s),\u03c8(z(s)|S, v, c(s))\u2225p(z(s)|c(s))\n\n,\n\n(3)\nwhich consists of two components, i.e., the approximate expectation E and the KL divergence DKL.\nThe former is maximized to reduce the reconstruction loss of the caption generation in decoder as Eq.\n1, while the later measures the difference between the distributions of the posterior q\u03d5,\u03c8(z|S, v, c)\nand the prior p(z|c) for the prior guidance (detailed in Sec. 3.3). Firstly, we de\ufb01ne the variables\nand parameters as following: \u201c(\u2113)\u201d and \u201c(s)\u201d are the marks for the variables and parameters of the\nlexicon and the syntax respectively. v, z, and c denote the visual feature of the image I, the lexi-\ncal/syntactic latent variable, and the lexical(word\u2019s)/syntactic(POS\u2019s) probability distribution from\nVP-tree (see Fig. 3), respectively. S denotes the caption, which is parsed and embedded into the\ntextual feature e.4 \u03b8 is the parameter set in the decoder, while \u03d5 and \u03c8 are the parameter sets of the\nlexical/syntactic posterior inference and the VarMI-tree trunk, respectively, in the inferer. Secondly,\nwe introduce the posterior/prior (Sec. 3.2) based on the above de\ufb01nitions: we adopt an additive Gaus-\nsian distribution for the posterior/prior to infer the latent variables. As shown in the middle of Fig.\n\u2032, are derived from\n2, the additive parameters, i.e., the mean (cid:22)/(cid:22)\nmultiple component parameters (means and sds of multiple Gaussian distributions, corresponding\nto different word\u2019s and POS\u2019s components and weighted by the probability distributions). Thirdly,\nwe describe the posterior/prior inference (Sec. 3.2): In the posterior inference, the additive and\ncomponent parameters are both parameterized by a linear function, while the component parameters\nare initialized randomly in the prior inference as shown in the middle of Fig. 2. Here we omit the\nprior inference due to the similarity to the posterior inference. The corresponding lexical-syntactic\n\u2032. Finally,\nlatent variable ~z/~z\n\u2032 are\nduring training/generation (detailed in Sec. 3.3), the visual feature v and the latent variable ~z/~z\nfed into LSTM to generate sequential caption outputs.\nIn the following, we brie\ufb02y introduce the lexicon-syntax based VP-tree in Sec. 3.1. We then give the\ndetails about the proposed VarMI-tree in Sec. 3.2. Finally, in Sec. 3.3, we introduce the proposed\nstructured encoder-inferer-decoder schema. For clarity, the main notations and their de\ufb01nitions\nthroughout the paper are shown in Tab. 1.\n\n\u2032 is sampled from the posterior/prior inference by reparameterizating z/z\n\n\u2032 and the square deviation (sd) (cid:27)/(cid:27)\n\n3In some complex tasks, e.g., image captioning, q(zjx) is commonly termed as inferer to differ from the\n\nvisual encoder CNN.\n\n4Textual parsing and pruning preprocesses are conducted by following [13] to obtain the tree structure.\n\n4\n\n\f3.1 Visual Parsing Tree\nVisual parsing tree (VP-tree) is \ufb01rstly pro-\nposed in [13], which serves as a robust parser\nto discover visual entities and their relations\nfrom a given image. To parse them in the lexi-\ncon and the syntax, we modify VP-tree as Fig.\n3 (a), where the probability distributions of\nK (\u2113) words and K (s) POSs, i.e., c(\u2113) \u2208 RK(\u2113)\nand c(s) \u2208 RK(s), are estimated in each node\nfor weighting in the subsequent VarMI-tree.\nFigure 3: The examplar subtrees of VP-tree and VarMI-\nThere are M (typically, M = 7) tree nodes\ntree. The differences lie in: 1) Single vs. multi-modal\nin these two binary trees. To distinguish these\nSemantic mapping. 2) Whether inferring the lexical and\ntwo trees, we de\ufb01ne the variables and param-\nsyntactic latent variables z(\u2113) and z(s) or not. 3) Proba-\neters of VP-tree with \u201c\n\u201d. As shown in Fig.\nbility distributions c(\u2113) and c(s) of the optimized VP-tree\n3 (a), VP-tree consists of three operations, Se-\nare utilized for weighting in the inference of VarMI-tree.\nmantic mapping, Node combining, and Clas-\nsifying, where the \ufb01rst two adopt normal linear mapping and concatenating operations upon visual\nfeature v to obtain the node feature h\nof each node is mapped into the\n\u00af\nword\u2019s and POS\u2019s category spaces, respectively according to their vocabularies. For the j-th node,\nwe obtain its word and POS probability distributions, i.e., c(\u2113)j and c(s)j, as follows:\n\n. In Classifying, the feature h\n\u00af\n\n\u00af\n\nr = f (cl)(W\nc(\u2113)j\n\u00af\n\nr h\n(\u2113)\n\n\u00af j + b\n\u00af\n\ns.t.(j : r) \u2208 {(1 : E), (3 : E), (5 : E), (7 : E), (2 : R), (6 : R), (4 : R)} ,\n\nr = f (cl)(W\nc(s)j\n\u00af\n\nr h\n(s)\n\n\u00af j + b\n\u00af\n\n(s)\nr ),\n\n(\u2113)\nr ),\n\n(4)\n\nfor the entity (r =\u201cE\u201d) or the\nwhere f (cl) is a Softmax function with parameters W\n(cl)\nr\n\u00af\nrelation (r =\u201cR\u201d) classi\ufb01cations. We unify cj\nr (r =\u201cE\u201d,\u201cR\u201d) into cj for simpli\ufb01cation. During\ntraining, cj is used to compute the cross entropy loss with the lexical/syntactic category labels4. The\nparameter set is \ufb01nally optimized for automatical tree construction given an image feature, where\neach node provides the optimal word\u2019s and POS\u2019s probability distributions c(\u2113) and c(s). Note that\none can replace VP-Tree with other alternative visual structured representations for the lexicon and\nsyntax. However, in order to directly demonstrate the effectiveness of the core idea, we intentionally\nchose the straightforward assistance of VP-Tree.\n\nand b\n\u00af\n\n(cl)\nr\n\n3.2 Variational Multi-modal Inferring Tree\nThe major challenge of VSSI-cap is to model the posterior inference of both the lexical and syntactic\nlatent variables, i.e., q\u03d5(\u2113),\u03c8(z(\u2113)|S, v, c(\u2113)) and q\u03d5(s),\u03c8(z(s)|S, v, c(s)), in the tree structure. To this\nend, we design a variational multi-modal inferring tree (VarMI-tree) to further innovate the VP-tree\nas illustrated in Fig. 3 (b) and Fig. 2 (Middle). VarMI-tree consists of three operations, i.e., Semantic\nmapping, Node combining, and Inferring. We itemize them as follows:\n\nSemantic Mapping.\nIn the encoder, the visual feature v is extracted from the last fully-connected\nlayer of CNN [27] while the j-th word\u2019s feature ej (j \u2208 {1, . . . , M} corresponds to the j-th tree\nnode) of the caption S is extracted by textual parsing and word embedding as aforementioned. In the\ninferer, these features are mapped into different semantic spaces, i.e., subjects, objects, and relations\nin VarMI-tree as shown in Fig. 3 and Fig. 2, which can be formulated as:\n\n(\n\n)\n\n,\n\ns.t.(j : r) \u2208 {(1 : Subj1), (3 : Obj1), (5 : Subj2), (7 : Obj2)} ,\n\nr\n\nr\n\nhj = f (sm)\n\nW(sm)\n\n[v; ej] + b(sm)\n\n(5)\n\nwhere r represents one of four semantic entity items, i.e., subject 1, object 1, subject 2, and object 2\nas set up in VP-tree. [\u00b7;\u00b7] is the concatenation operation. f (sm) denotes a non-linear function with\nthe parameters W (sm)\nfor Semantic mapping in the j-th node (j = 1, 3, 5, 7). For the\nnon-leaf nodes (j = 2, 4, 6), similar operation is conducted as above, where, however, v is replaced\nwith the combination features (computed in next part) as shown in Fig. 3.\n\nand b(sm)\n\nr\n\nr\n\nNode Combining. The Node combining operation of VarMI-tree is the same as the one of VP-tree.\nCorrespondingly, we denote the parameters with W(nc) and b(nc).\n\n5\n\n(Subject 1)(Object 1)(Sub-relation 1)childcar seatin(Subject 1)(Object 1)(Sub-relation 1)(cid:1822)(cid:1822)(cid:1805)(cid:2870)(cid:1805)(cid:2869)(cid:1805)(cid:2871)ClassifyingNode combiningSemantic mappingInferring(cid:1822)(cid:1822)(a) VP-tree(b) VarMI-tree(cid:1803)(cid:4666)(cid:2194)(cid:4667)(cid:2870)(cid:352)(cid:346)(cid:347)(cid:348)(cid:349)(cid:350)(cid:351)Node Index(cid:346)(cid:348)(cid:347)(cid:346)(cid:348)(cid:347)(cid:1826)(cid:4666)(cid:2194)(cid:4667)(cid:2870)(cid:1826)(cid:4666)(cid:2201)(cid:4667)(cid:2870)(cid:1826)(cid:4666)(cid:2194)(cid:4667)(cid:2869)(cid:1826)(cid:4666)(cid:2201)(cid:4667)(cid:2869)(cid:1826)(cid:4666)(cid:2194)(cid:4667)(cid:2871)(cid:1826)(cid:4666)(cid:2201)(cid:4667)(cid:2871)(cid:1808)(cid:2870)(cid:1808)(cid:2871)(cid:1808)(cid:2869)(cid:1803)(cid:4666)(cid:2194)(cid:4667)(cid:2869)(cid:1803)(cid:4666)(cid:2201)(cid:4667)(cid:2870)(cid:1808)(cid:2870)(cid:1808)(cid:2869)(cid:1808)(cid:2871)(cid:1803)(cid:4666)(cid:2201)(cid:4667)(cid:2871)(cid:1803)(cid:4666)(cid:2194)(cid:4667)(cid:2871)(cid:1803)(cid:4666)(cid:2201)(cid:4667)(cid:2869)\fInferring. For clarity, we de\ufb01ne the function Hj as a uni\ufb01ed operation of the above Semantic\nmapping and Node combining for the j-th node feature, i.e, hj = Hj(v, e; \u03c8). In the j-th node, the\nlexical and syntactic posterior inferences can be approximated upon an additive Gaussian distribu-\ntion. For clarity, we only formulate it on the lexicon below:\n\nq\u03d5(\u2113),\u03c8(z(\u2113)j|S, v, c(\u2113)j) = N(\n\nz(\u2113)j|\n\n\u2211K(\u2113)\n\nk=1\n\nk (cid:22)(\u2113)j\nc(\u2113)j\n\nk\n\n(Hj), (cid:6)(\u2113)j2I\n\n,\n\n(6)\n\nwhere (cid:6)(\u2113)j2I is the spherical covariance matrix with (cid:6)(\u2113)j2 =\nnotes the length of the word\u2019s vocabulary. The component Gaussian parameters can be obtained:\n\nK(\u2113)\nk=1 c(\u2113)j\n\nk (cid:27)(\u2113)j\n\n(Hj)2. K (\u2113) de-\n\nk\n\n\u2211\n\n(cid:22)(\u2113)j\n\nk\n\n(Hj) = W(\u2113)j\n\u00b5k\n\nHj + b(\u2113)j\n\u00b5k\n\n,\n\nlog (cid:27)(\u2113)j\n\nk\n\n(Hj)2 = W(\u2113)j\n\u03c3k\n\nHj + b(\u2113)j\n\u03c3k\n\n,\n\n(7)\n\nTo enable the differentiability in the end-to-end manner, we reparameterize z(\u2113)j into ~z(\u2113)j via the\nreparameterization trick [23] as:\n(8)\nwhere \"(\u2113) obeys a standard Gaussian distribution to introduce noise for the lexical diversity. \u2299 is\nan element-wise product. Similar to the posterior, the prior p(z(\u2113)j|c(\u2113)j) can be formulated as:\n\n~z(\u2113)j = (cid:22)(\u2113)j + (cid:27)(\u2113) \u2299 \"(\u2113),\n\n)\n\n)\n\np(z(\u2113)j|c(\u2113)j) = N(\n\nz(\u2113)j|\n\n\u2211K(\u2113)\n\nk=1\n\n\u2032(\u2113)\nwhere (cid:22)\nk\n\nand (cid:27)\n\n\u2032(\u2113)\nk\n\nare randomly initialized. z\n\n\u2211K(\u2113)\n\n\u2032(\u2113)j\nc(\u2113)j\nc(\u2113)j\nk (cid:22)\nk (cid:27)\nk\n\u2032(\u2113)j is reparameterized into ~z\n\nk=1\n\n, (\n\n\u2032(\u2113)j2\nk\n\n)I\n\n\u2032(\u2113)j as Eq. 8.\n\n,\n\n(9)\n\n3.3 Structured Encoder-inferer-decoder\nThe structured encoder-inferer-decoder schema aims at integrating the lexical/syntactic latent vari-\nables in a tree structure to diversify the generated captions. Following Eq. 3, we give the \ufb01nal\nobjective function as follows:\n\u2212\u2211\nLVSSI-Cap(\u03b8, \u03d5(\u2113), \u03d5(s), \u03c8; S, v, c(\u2113), c(s)) = Ed(\u03b8; S, v, c(\u2113), c(s))\n\n)]\nq\u03d5(s),\u03c8(z(s)j|S, v, c(s)j)\u2225p(z(s)j|c(s)j)\n\n(\nq\u03d5(\u2113),\u03c8(z(\u2113)j|S, v, c(\u2113)j)\u2225p(z(\u2113)j|c(\u2113)j)\n\n+ DKL\n\n(10)\n\nDKL\n\n[\n\n)\n\n(\n\n,\n\nM\nj=1\n\nwhere most of the above notations are de\ufb01ned in Eq. 3. DKL can be approximated following [28] (see\nalgorithm \ufb02ow in supplementary material). Ed is the approximate expectation on the log-likelihood\nof p\u03b8(S|I) in decoder. For the reconstruction loss, we use Monte Carlo method to approximate the\nexpectation Ed in Eq. 10 after sampling ~z(\u2113)j and ~z(s)j, which is formulated as:\n\nEd =\n\n1\nN\n\nj=1, v, c(\u2113), c(s)),\ns.t. \u2200i, j z(\u2113)j(i) \u223c q\u03d5(\u2113),\u03c8(z(\u2113)j|S, v, c(\u2113)j), z(s)j(i) \u223c q\u03d5(s),\u03c8(z(s)j|S, v, c(s)j),\n\nt=0\n\ni=1\n\nlog p(St|S0:t\u22121,{z(\u2113)j(i)}M\n\nj=1,{z(s)j(i)}M\n\n(11)\n\n\u2211N\n\n\u2211T\n\nwhere N and T denote the sample number of z(i) (sampled by Eq. 8) and the length of the caption,\nrespectively. Since the objective function in Eq. 10 is differentiable, we optimize the model param-\neter set \u03b8, \u03d5(\u2113), \u03d5(s), and \u03c8 jointly using stochastic gradient ascent method. To generate captions,\nwe use the above optimal parameters and choose the t-th word ~St over the dictionary according to\n~St = arg maxSt p(St|S0:t\u22121, z\n\u2032 are concatenated to feed the decoder.\n\n\u2032(s), v), where v and z\n\n\u2032(\u2113), z\n\n4 Experiments\nDataset and Metrics. We conduct all the experiments on the MSCOCO dataset5 [30], which is\nwidely used for image captioning [1, 3] and diverse image captioning [5, 8]. There are over 93K\nimages in MSCOCO, which has been split into training, testing and validating sets6. Each image\nhas at least \ufb01ve manual captions. The quality of captioning results lies in both accuracy (a basic\nevaluation of captioning quality and has been used together with the subsequent diversity metrics in\n[8, 5, 6]) and diversity. For accuracy, we use the MSCOCO caption evaluation tool7 by choosing\n\n5http://cocodataset.org/#download\n6https://github.com/karpathy/neuraltalk\n7https://github.com/tylin/coco-caption\n\n6\n\n\fMetric\nErDr-cap [29]\nUp-Down [3]\nG-GAN [6]\nAdv [5]\nCAL [9]\nGMM-CVAE [8]\nAG-CVAE [8]\nVSSI-cap-L\nVSSI-cap-S\nVSSI-cap\n\nBleu-1\n69.9\n79.8\n\n-\n-\n\n66.5\n70.0\n70.2\n69.9\n70.4\n70.4\n\nBleu-2\n51.8\n\nBleu-3\n36.6\n\n-\n-\n-\n\n48.4\n52.0\n52.2\n51.9\n52.7\n52.7\n\n-\n\n-\n\n30.5\n\n33.2\n37.1\n37.1\n37.3\n37.9\n38.1\n\n-\n\n\u201d respectively.\nBleu-4 Meteor\n23.1\n25.6\n27.7\n36.3\n20.7\n22.4\n23.9\n22.6\n23.2\n23.4\n23.5\n23.8\n23.9\n\n21.8\n26.0\n26.0\n26.1\n27.1\n27.3\n\nRouge-L\n\n50.3\n56.9\n47.5\n\n-\n\n47.8\n50.6\n50.6\n50.7\n51.1\n51.3\n\nCIDEr\n84.3\n120.1\n79.5\n\n-\n\n75.3\n85.4\n85.7\n87.3\n88.8\n89.4\n\nSpice\n16.4\n21.4\n18.2\n16.7\n16.4\n16.3\n16.5\n16.8\n17.0\n17.1\n\nTable 2: Performance comparisons on accuracy of diverse image captioning. All values are in %. The \ufb01rst and\nthe second places are marked with the bold font and \u201c\n\n\u201d respectively.\nmB.+\n51.0\n78.0\n80.9\n\nTable 3: Performance comparisons on diversity. \u201c+\u201d and \u201c*\u201d denote that lower and higher are better, respec-\ntively. \u201cn\u201d denotes the number of generated captions (default 5). All values are in %. The \ufb01rst and the second\nplaces are marked with the bold font and \u201c\nNum.\nn=5\nn=5\nn=5\nn=5\nn=5\nn=5\nn=5\nn=10\nn=5\nn=5\nn=5\nn=10\n\nMetric\nhuman\nErDr-cap [29]\nUp-Down [3]\nG-GAN [6]\nAdv [5]\nCAL [9]\nAG-CVAE [8]\nAG-CVAE [8]\nVSSI-cap-L\nVSSI-cap-S\nVSSI-cap\nVSSI-cap\n\n44.0\n40.7\n42.9\n31.3\n45.6\n46.3\n47.2\n33.2\n\n34.0\n32.5\n33.1\n22.7\n34.3\n33.8\n33.9\n22.3\n\n79.67\n79.68\n79.30\n80.26\n85.20\n80.34\n\ndiv2*\n48.0\n38.0\n35.8\n\ndiv1*\n34.0\n28.0\n27.1\n\n66.9\n70.8\n80.2\n82.4\n83.0\n80.7\n\n70.2\n77.3\n68.7\n63.0\n62.4\n74.2\n\n34.18\n63.60\n81.52\n73.92\n\nUni.*\n99.8\n\nNov.*\n\n70.0\n\n-\n\n-\n\n-\n\n-\n\n-\n-\n-\n-\n-\n\n-\n\n-\n\nthe best-performing one from the top-5 outputs, including Bleu, Meteor, Rouge-L, CIDEr [30] and\nSpice [31]. For diversity, we use the benchmark metrics in [5, 8]: 1) Div1, the ratio of unique\nunigrams to words in the generated captions. Higher div1 means more diverse. 2) Div2, the ratio\nof unique bigrams to words in the generated captions. Higher div2 means more diverse. 3) mBleu\n(mB.), the mean of Bleu scores, which are computed between each caption in the generated captions\nagainst the rest. Lower mB. means more diverse. 4) Unique Sentence (Uni.), the average percentage\nof unique captions in candidates generated for each image. 5) Novel Sentence (Nov.), the percentage\nof the generated captions that do not appear in the training set. For uniformity, each output caption\ncorresponds to a sample of z.\n\nPreprocessing, Parameter Settings, and Implementation Details.\nIn the proposed VarMI-tree,\nwe set the feature dimension of each node as 512. The dimensions of each mean, each sd, and each\nlatent variable are set as 150. We parse the captions by using the Stanford Parser [32] as well as\npruning the textual parsing results by using the pos-tag tool and the lemmatizer tool in NTLK [33],\nwhere the dynamic textual parsing trees are converted to a \ufb01xed-structured, three-layer, complete\nbinary tree as designed in [13]. Only the words (including entities and relations) and the POSs (i.e.,\nNOUN, VERB, PREP, and CONJ) with high frequency are left to form the vocabularies. Nouns are\nregarded as entities and used as leaf nodes in the textual parsing tree, while others (verbs, coverbs,\nprepositions, and conjunctions) are taken as relations for non-leaf nodes. The sizes of the entity\u2019s,\nrelation\u2019s and POS\u2019s vocabularies, are 840, 248, and 4, respectively.8 We extract the visual features\nfrom VGG-16 network [25]. In LSTM, we use the same vector dimensions of the hidden states as\n[29], which is set as 512. We set the word vector dimension as 256 during word embedding. We\nimplement our model training based on the public code9 with the standard data split and the separate\nz samples. KL annealing method [34] is adopted to reduce the KL vanishing (see the supplementary\nmaterial for the training details). All networks are trained with SGD with a learning rate 0.005 for\nthe \ufb01rst 5 epochs, and is reduced by half every 5 epochs. On average, all models converge within 50\nepochs. The overall process takes 37 hours on a NVIDIA GeForce GTX 1080 Ti GPU with 11GB\nmemory.\n\n8https://github.com/cfh3c/NeurIPS19_VPtree_Dics\n9https://github.com/yiyang92/vae_captioning\n\n7\n\n\fBaselines and Competing Methods. We compare the proposed VSSI-cap with four baselines: 1)\nErDr-cap: a caption generator trained based on encoder-decoder (beam search) [29] that represents\nthe mainstream of general image captioning. 2) AG-CVAE [8]: a recent generative model consider-\ning the variation over detected objects for diverse image captioning. 3) VSSI-cap-L: an alternative\nversion of VSSI-cap, which omits the syntax. 4) VSSI-cap-S: an alternative version of VSSI-cap,\nwhich omits the lexicon. We also compare VSSI-cap with the state-of-the-art method Adv [5] and\nAG-CVAE [8] (evaluated on the aforementioned universal split). Besides, we compare VSSI-cap\nwith 1) other recent diverse image captioning methods, including G-GAN [6], GMM-CVAE [8], and\nCAL [9], 2) the state-of-the-art image captioning method, i.e., Up-Down (beam search) [3], and 3)\nHuman: a sentence randomly sampled from ground-truth/manually-labeled annotations of each im-\nage is used as the output of this method. Note that comparing to pure image captioning methods\n(only aiming at accuracy) seems far-fetched due to the mutual interference between accuracy and di-\nversity (a more diverse caption tends to be more inconsistent with the ground truth caption) [9, 6, 5],\nwhere, therefore, the pure image captioning methods are taken as extraessential references.\n\nEvaluation on Accuracy. Tab. 2 presents the accuracy comparisons of our VSSI-cap to the base-\nlines and state-of-the-arts. Compared to others (except the state-of-the-art image captioning method),\nVSSI-cap achieves the best performance under most metrics. Specially, VSSI-cap outperforms AG-\nCVAE under all metrics, e.g., 89.4% vs. 85.7% on CIDEr, which re\ufb02ects the superiority of visual\nsemantic representation in the proposed VarMI-tree. Additionally, the propsoed structured encoder-\ninferer-decoder schema also contributes to the improvement of accuracy according to the comparison\nwith ErDr-cap. Particularly, the gaps become larger from Bleu-1 to Bleu-4 (from 1-gram to 4-gram),\nmanifesting the superiority of the structured semantic representation in VSSI-cap. In summary, al-\nthough VSSI-cap is designed for diverse captioning, the various but accurate visual semantic is well\ncaptured in the lexical and syntactic parsing results, which promotes the accuracy of VSSI-cap in\nthe task of general image captioning.\n\nEvaluation on Diversity. We compare the\nproposed VSSI-cap to the baseline and state-\nof-the-art methods on the diversity metrics in\nTab. 3 shows. Despite there is a gap on the di-\nversity when compared to the human captions,\nVSSI-cap achieves the best performance com-\npared to other learning methods under most\nmetrics, e.g., the best 62.4% on mBleu (lower\nis better), which re\ufb02ects the effectiveness of\nconsidering both the lexical and syntactic di-\nversities in diverse image captioning, as well\nas the superiority of the proposed VarMI-tree\nbased inferer on modeling these diversities.\nSpecially, VSSI-cap-L and VSSI-cap-S also achieve competitive performance. This manifests the\nsigni\ufb01cant roles of the lexical and syntactic diversities respectively. We conduct additional compar-\nisons on the results with 10 generated captions (5 is default), where VSSI-cap (n=10) also outper-\nforms AG-CVAE (n=10). We further retrieve the images of the generated captions in 5,000 images\n(randomly selected) by taking the captions as queries. The recalls of the ranking results are shown\nin Fig. 4, where our VSSI-cap is shown to provide more discriminative descriptions, outperforming\nothers by a large margin across all cases. To qualitatively compare the performances on diversity,\nwe output the results of VSSI-cap and the baselines of ErDr-cap and AG-CVAE (also a state-of-the-\nart) in Fig. 5. Clearly, VSSI-cap generates more diverse captions, which further demonstrates the\nsuperiority of the proposed VSSI-cap.\n\nFigure 4: The recalls of image rankings for different\nmethods. Given the generated caption queries, R@k is\nthe ratio of correct images being ranked within the top k\nresults. The left is based on the similarity (Left) between\nthe generated caption ~S and each image I, while the right\nis based on the log-likelihoods (Right) P ( ~SjI), computed\nin different methods.\n\nModel Analysis.\nIt\u2019s a challenge to analyze the internal mechanism of the VAE-based structured\nencoder-inferer-decoder due to different vector spaces among 1) different node features, 2) differ-\nent Gaussian functional parameters, and 3) different lexical/syntactic variables over different nodes.\nFortunately, the parsing results of VP-tree can be assigned with different probability distributions\n(inputs of VarMI-tree) in each node to indirectly verify the effectiveness of the VarMI-tree, as shown\nin Fig. 6. Highly diverse captions are generated derived from different visual parsing trees with\ndifferent lexical/syntactic probability distributions. This demonstrates the effectiveness of VarMI-\n\n8\n\n13 510k0102030405060R@kErDr-capAG-CVAEVSSI-cap13 510k0102030405060R@kErDr-capAG-CVAEVSSI-cap\fFigure 5: Visualization of diverse captions (top 3) generated by ErDr-cap (blue), AG-CVAE (green), and our\nVSSI-cap (red). More results are presented in the supplementary material.\n\nFigure 6: Internal view on the effectiveness of the VarMI-tree by changing its inputs explicitly, i.e., assigning\ndifferent lexical/syntactic probability distributions of each node from VP-tree (refer to Fig. 3 for the node index).\nThe histograms of each example re\ufb02ect different visual parsing trees with different probability distributions\nassigned in each node, where the middle is for the original parsing results (best-in-top3 is shown in each node)\nfrom VP-tree, while the left/right is for the parsing results partly changed from the middle mainly on word/POS.\nCaptions are generated according to different visual parsing trees at the bottom.\n\ntree on modeling the lexical/syntactic diversity and embedding them into caption generation in the\nproposed structured encoder-inferer-decoder.\n\n5 Conclusion\nIn this paper, we exploit the key factors of diverse image captioning, i.e., the lexical and syntactic\ndiversities. To model these two diversities into image captioning, we propose a variational structured\nsemantic inferring model (VSSI-cap) with a novel variational multi-modal inferring tree (VarMI-\ntree) on a structured encoder-inferer-decoder schema. Specially, conditioned on the visual-textual\nfeatures from encoder, VarMI-tree models the lexicon and the syntax, as well as inferring their latent\nvariables in approximate posterior inference guided by the visual prior. Reconstruction loss and\nKL-divergence are jointly estimated to optimize the VSSI-cap model to generate diverse captions.\nExperiments on benchmark dataset demonstrate that the proposed VSSI-cap achieves signi\ufb01cant\nimprovements over the state-of-the-arts.\n\nAcknowledgments\nThis work is supported by the National Key R&D Program (No.2017YFC0113000, and\nNo.2016YFB1001503), Nature Science Foundation of China (No.U1705262, No.61772443,\nNo.61572410, and No.61702136), Post Doctoral Innovative Talent Support Program under Grant\nBX201600094, China Post-Doctoral Science Foundation under Grant 2017M612134, Scienti\ufb01c Re-\nsearch Project of National Language Committee of China (Grant No. YB135-49), and Nature Sci-\nence Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).\n\n9\n\na man sitting at a table with a plate of fooda man is eating a meal with a wine glassa man sitting at a table with a sandwicha man sitting at a table with a plate of fooda man sitting at a table with a slice of pizzaa man holding a plate of food in his handa man sitting at a table with a wine glassa man sitting at a table with a sandwich a cat sitting on a table with a laptop on ita black and white cat sitting on a beda cat sitting on a bed with a booka cat sitting on a bed with a laptop on ita cat sitting on a bed with a laptopa cat sitting on a table with a laptop on ita cat is sitting on a bed with a book and a phonea black and white cat sitting on a beda cat sitting on top of a bed with a phonea man and a woman are eating a sandwicha city street with people walking down the streeta man walking down a street with umbrellaa city street with people walking and umbrellasa man is walking down a street with umbrellaa man walking down a street with umbrellaa man walking down a street with a street signa street scene with people walking and umbrellasa person walking down a street with a street signa city street with people walking down the streeta couple of giraffes are standing in a zooa group of people standing around a fencetwo giraffes are standing in a fenced areaa couple of giraffes are standing in a zooa group of giraffes are standing in a zooa giraffe is standing in a zoo enclosuretwo giraffes standing next to a fence in a zootwo large black and white giraffes standing next to a mana giraffe is standing in the middle of a zooalargewhiteplatewithasandwichandaknifeaplateoffoodwithaforkandahotdogahotdogonaplatewithaforkandaknifeOriginal wordOriginal POSChanged word/POSacoupleofpeoplesmilingatthecameranexttoatableamanandawomanareusingalaptopnexttoatableawomaninadressnexttoatableOriginal wordOriginal POSChanged word/POS\fReferences\n[1] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic\n\nattention. In CVPR, pages 4651\u20134659, 2016.\n\n[2] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention\n\nvia a visual sentinel for image captioning. In CVPR, pages 375\u2013383, 2017.\n\n[3] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei\nZhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR,\npages 6077\u20136086, 2018.\n\n[4] Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C Lawrence Zitnick. Exploring\n\nnearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015.\n\n[5] Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. Speaking the\nsame language: Matching machine to human captions by adversarial training. In ICCV, pages 4155\u20134164,\n2017.\n\n[6] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image descriptions\n\nvia a conditional gan. In ICCV, pages 2989\u20132998, 2017.\n\n[7] Unnat Jain, Ziyu Zhang, and Alexander Schwing. Creativity: Generating diverse questions using varia-\n\ntional autoencoders. In CVPR, pages 5415\u20135424, 2017.\n\n[8] Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. Diverse and accurate image description using a\nvariational auto-encoder with an additive gaussian encoding space. In NeurIPS, pages 5756\u20135766, 2017.\n\n[9] Dianqi Li, Qiuyuan Huang, Xiaodong He, Lei Zhang, and Ming-Ting Sun. Generating diverse and accu-\nrate visual captions by comparative adversarial learning. In NeurIPS workshop on ViGIL, pages 27:1\u20136,\n2018.\n\n[10] Paul Martin Lester. Syntactic theory of visual communication. Retrieved December, 3:1\u201314, 2006.\n\n[11] Pamela A Hadley, Megan M McKenna, and Matthew Rispoli. Sentence diversity in early language de-\nvelopment: Recommendations for target selection and progress monitoring. American journal of speech-\nlanguage pathology, 27(2):553\u2013565, 2018.\n\n[12] Marjorie Meecham and Janie Rees-Miller. Language in social contexts. Contemporary Linguistics, pages\n\n537\u2013590, 2005.\n\n[13] Fuhai Chen, Rongrong Ji, Jinsong Su, Yongjian Wu, and Yunsheng Wu. Structcap: Structured semantic\n\nembedding for image captioning. In ACM MM, pages 46\u201354, 2017.\n\n[14] Xian Wu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin.\ntrajectory structured localization. In CVPR, pages 6829\u20136837, 2018.\n\nInterpretable video captioning via\n\n[15] Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, and Jinsong Su. Groupcap: Group-based image\n\ncaptioning with structured relevance and diversity constraints. In CVPR, pages 1345\u20131353, 2018.\n\n[16] Bo Dai, Sanja Fidler, and Dahua Lin. A neural compositional paradigm for image captioning. In NeurIPS,\n\npages 656\u2013666, 2018.\n\n[17] Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim. Attend to you: Personalized image caption-\n\ning with context sequence memory networks. In CVPR, pages 895\u2013903, 2017.\n\n[18] Zhuhao Wang, Fei Wu, Weiming Lu, Jun Xiao, Xi Li, Zitong Zhang, and Yueting Zhuang. Diverse image\n\ncaptioning via grouptalk. In IJCAI, pages 2957\u20132964, 2016.\n\n[19] Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. Stylenet: Generating attractive visual\n\ncaptions with styles. In CVPR, pages 3137\u20133146, 2017.\n\n[20] Alexander Mathews, Lexing Xie, and Xuming He. Semstyle: Learning to generate stylised image captions\n\nusing unaligned text. In CVPR, pages 8591\u20138600, 2018.\n\n[21] Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Cran-\ndall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models.\narXiv preprint arXiv:1610.02424, 2016.\n\n10\n\n\f[22] Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. Context-aware\n\ncaptions from context-agnostic supervision. In CVPR, pages 251\u2013260, 2017.\n\n[23] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[24] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-\n\nimate inference in deep generative models. In ICML, pages 1278\u20131286, 2014.\n\n[25] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep\n\nconditional generative models. In NeurIPS, pages 3483\u20133491, 2015.\n\n[26] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image genera-\n\ntion from visual attributes. In ECCV, pages 776\u2013791, 2016.\n\n[27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recog-\n\nnition. arXiv preprint arXiv:1409.1556, 2014.\n\n[28] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between gaussian\n\nmixture models. In ICASSP, volume 4, pages IV\u2013317, 2007.\n\n[29] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\n\ncaption generator. In CVPR, pages 3156\u20133164, 2015.\n\n[30] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r, and\nC Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint\narXiv:1504.00325, 2015.\n\n[31] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional\n\nimage caption evaluation. In ECCV, pages 382\u2013398, 2016.\n\n[32] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and natural\n\nlanguage with recursive neural networks. In ICML, pages 129\u2013136, 2011.\n\n[33] Edward Loper and Steven Bird. Nltk: the natural language toolkit. arXiv preprint cs/0205028, 2002.\n\n[34] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\nGenerating sentences from a continuous space. In CoNLL, pages 10\u201321, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1113, "authors": [{"given_name": "Fuhai", "family_name": "Chen", "institution": "Xiamen University"}, {"given_name": "Rongrong", "family_name": "Ji", "institution": "Xiamen University, China"}, {"given_name": "Jiayi", "family_name": "Ji", "institution": "Xiamen University"}, {"given_name": "Xiaoshuai", "family_name": "Sun", "institution": "Xiamen University"}, {"given_name": "Baochang", "family_name": "Zhang", "institution": "Beihang University"}, {"given_name": "Xuri", "family_name": "Ge", "institution": "Xiamen University"}, {"given_name": "Yongjian", "family_name": "Wu", "institution": "Tencent Technology (Shanghai) Co.,Ltd"}, {"given_name": "Feiyue", "family_name": "Huang", "institution": "Tencent"}, {"given_name": "Yan", "family_name": "Wang", "institution": "Microsoft"}]}