{"title": "Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 2651, "page_last": 2661, "abstract": "This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval.\nWe show that using multiple rounds of natural language queries as input can be surprisingly effective to find arbitrarily specific images of complex scenes. Furthermore, we find that existing image datasets with textual captions can provide a surprisingly effective form of weak supervision for this task. We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators.", "full_text": "Drill-down: Interactive Retrieval of Complex Scenes\n\nusing Natural Language Queries\n\nFuwen Tan\n\nUniversity of Virginia\n\nfuwen.tan@virginia.edu\n\nPaola Cascante-Bonilla\nUniversity of Virginia\npc9za@virginia.com\n\nXiaoxiao Guo\nIBM Research AI\n\nxiaoxiao.guo@ibm.com\n\nHui Wu\n\nIBM Research AI\nwuhu@us.ibm.com\n\nSong Feng\n\nIBM Research AI\n\nsfeng@us.ibm.com\n\nVicente Ordonez\n\nUniversity of Virginia\n\nvicente@virginia.edu\n\nAbstract\n\nThis paper explores the task of interactive image retrieval using natural language\nqueries, where a user progressively provides input queries to re\ufb01ne a set of retrieval\nresults. Moreover, our work explores this problem in the context of complex image\nscenes containing multiple objects. We propose Drill-down, an effective framework\nfor encoding multiple queries with an ef\ufb01cient compact state representation that\nsigni\ufb01cantly extends current methods for single-round image retrieval. We show\nthat using multiple rounds of natural language queries as input can be surprisingly\neffective to \ufb01nd arbitrarily speci\ufb01c images of complex scenes. Furthermore, we\n\ufb01nd that existing image datasets with textual captions can provide a surprisingly\neffective form of weak supervision for this task. We compare our method with\nexisting sequential encoding and embedding networks, demonstrating superior per-\nformance on two proposed benchmarks: automatic image retrieval on a simulated\nscenario that uses region captions as queries, and interactive image retrieval using\nreal queries from human evaluators.\n\n1\n\nIntroduction\n\nRetrieving images from text-based queries has been an active area of research that requires some level\nof visual and textual understanding. Signi\ufb01cant improvement has been achieved over the past years\nwith advances in representation learning but \ufb01nding very speci\ufb01c images with detailed speci\ufb01cations\nremains challenging. A common way of speci\ufb01cation is through natural language queries, where a\nuser inputs a description of the image and obtains a set of results. We focus on a common scenario\nwhere a user is trying to \ufb01nd an exact image, or similarly where the user has a very speci\ufb01c idea of a\ntarget image, or is deciding on-the-\ufb02y while querying. We present empirical evidence that users are\nmuch more successful if they are allowed to re\ufb01ne their search results with subsequent textual queries.\nUsers might start with a general query about the \u201cconcept\u201d of the image they have in mind and then\n\u201cdrill down\u201d onto more speci\ufb01c descriptions of objects or attributes in the image to re\ufb01ne the results.\nAmong previous efforts in image retrieval, a promising paradigm is to learn a visual-semantic\nembedding by minimizing the distance between a target image and an input textual query using a joint\nfeature space. Pioneering approaches such as [17, 34, 9, 21, 36, 33] have demonstrated remarkable\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An example of the interactive image retrieval with our Drill-down model, where a user\ngenerated query (Ut) progressively re\ufb01nes the search results (St) until the target image is among top\nsearch results.\n\nperformance on large scale datasets such as Flickr30K [26] and COCO [23], and domain-speci\ufb01c\ntasks such as out\ufb01t composition [12]. However, we \ufb01nd that these methods are limited in their\ncapacity for retrieving highly speci\ufb01c images, because it is either dif\ufb01cult for users to be speci\ufb01c\nenough with a single query or users may not have the full picture in mind beforehand. We show an\nexample of this type of interaction in Figure 1. While single-query retrieval might be more suited for\ndomains such as product search where images typically contain only one object, requiring users to\ndescribe a whole scene in one sentence might be too demanding. More recently, dialog based search\nhas been proposed to overcome some of the limitations of single-query retrieval [22, 31, 10, 7].\nIn this paper, we propose Drill-down, an interactive image search framework for retrieving complex\nscenes, which learns to capture the \ufb01ne-grained alignments between images and multiple text queries.\nOur work is inspired by the observations that: (1) user queries at each turn may not exhaustively\ndescribe all the details of the target image, but focus on some local regions, which provide a natural\ndecomposition of the whole scene. Therefore, we explicitly represent images as a list of object/stuff\nlevel features extracted from a pre-trained object detector [27]. This is also in line with recent\nresearch [21, 36] on learning region-phrase alignments for single-query methods; (2) complex scenes\ncontain multiple objects that might share the same feature subspace. Particularly, existing state\nrepresentations of sequential text queries, such as the hidden states of a RNN, condense all image\nproperties in a single state vector, which makes it dif\ufb01cult to distinguish entities sharing the same\nfeature subspace, such as multiple person instances. To address this, we propose to maintain a set\nof state vectors, encouraging each of the vectors to encode text queries corresponding to a distinct\nimage region. Figure 2 shows an overview of our approach, images are represented with local feature\nrepresentations, and the query state is represented by a \ufb01xed set of vectors that are selectively updated\nwith each subsequent query.\nWe demonstrate the effectiveness of our approach on the Visual Genome dataset [20] in two scenarios:\nautomatic image retrieval using region captions as queries, and interactive image retrieval with real\nqueries from human evaluators. In both cases, our experimental results show that the proposed model\noutperforms existing methods, such as a hierarchical recurrent encoder model [29], while using less\ncomputational budget.\nOur main contributions can be summarized as follows: 1\n\nwhich leverages region captions as a form of weak supervision during training.\n\n\u2022 We propose Drill-down, an interactive image search approach with multiple round queries\n\u2022 We conduct experiments on a large-scale natural image dataset: Visual Genome [20], and\ndemonstrate superior performance of our model on both simulated and real user queries;\n\u2022 We show that our model, while producing a compact representation, outperforms competing\n\nbaseline methods by a signi\ufb01cant margin.\n\n2 Related Work\n\nText-based image retrieval has been an active research topic for decades [5, 4, 28]. Prominent more\ncontemporary works have recognized the need for richer user interactions in order to obtain higher\n\n1Codes are available at https://github.com/uvavision/DrillDown\n\n2\n\n\fquality results [30, 18, 19, 2]. Siddiquie et al [30] proposed an approach to use multiple query\nattributes. Kovashka et al [18, 19] further proposed using user feedback based on individual visual\nattributes to progressively improve search results. Arandjelovic et al [2] proposed a multiple query\nretrieval system that was used for querying speci\ufb01c objects within a large set of images. These works\nshow that multiple independent queries generally outperform methods that jointly model the input set\nwith a single query. Our work builds on these previous ideas but does not use an explicit notion of\nattributes and aims to support more general input text queries.\nRemarkable results have been achieved by recent methods based on deep learning [17, 34, 9]. These\nmethods typically explore mapping a text query and the target image into a common feature space.\nLearned feature representations are designated to capture both visual and semantic information in the\nsame embedding space. In contrast, besides supporting multiple rounds of queries, our approach also\nhas a richer region representation to explicitly map individual entities in images to textual phrases.\nAnother line of recent inquiry are dialog based image search systems [22, 10]. Liao et al [22] proposed\nto aggregate multi-round user responses from trained agents or human agents in order to iteratively\nre\ufb01ne a retrieved set of images using a hierarchical recurrent encoder-decoder framework [29]. We\nfollow a similar protocol, but we explore a more open-ended domain of images corresponding to\nscenes depicting multiple objects. The method Guo et al [10] as in our work, used multiple rounds of\nnatural language queries, and proposed collecting relative image captions as supervision for a product\nsearch task. In contrast, we pursue a weakly supervised approach where we leverage an image dataset\nwith region captions that are used to simulate queries during training, thus bypassing the need to\ncollect extra annotations. We demonstrate that training with simulated queries is surprisingly effective\nunder human evaluations. As the hierarchical recurrent framework [29] was used in most of the\nprevious dialog based methods [6, 7, 31, 22, 10], we provide a re-implementation of the hierarchical\nencoder (HRE) model with the queries as context and use it as one of our baselines. Different from\nthe previous dialog based methods where the systems also provide textual responses, we explore a\nscenario where the system only responses with retrieved images, so no decoder module is required in\nour case.\nAlso relevant to our research are the existing works on learning image-word [14, 11, 21] or region-\nphrase [25] alignments for vision-language tasks. For instance, Karpathy et al [14] proposed to learn\na bidirectional image-sentence mapping by jointly embedding fragments of images (objects) and\nsentences. The image fragments are extracted using a pre-trained object detector, while the sentence\nfragments are obtained using a dependency tree relation parser. Niu et al [25] extended this work by\njointly learning hierarchical relations between phrases and image regions in an iterative re\ufb01nement\nframework. Recently, Lee et al [21] developed a stacked cross attention network for word-region\nmatching. Compared to these models, our proposed query state encoding aims at integrating multiple\nround queries while still using a compact representation of \ufb01xed size (i.e. independent of the number\nof queries), so that retrieval times do not depend on the number or the length of the queries. We show\nour compact representation to be both ef\ufb01cient and effective for interactive image search.\nMore closely related to our work are Memory Networks [35, 32, 15], which perform query and\npossibly update operations on a prede\ufb01ned memory space. In contrast to this line of research, we\nexplore a more challenging scenario where the model needs to create and update the memory (i.e.\nthe state vectors) on-the-\ufb02y so as to maintain the states of the queries.\n\n3 Model\n\nRetrieving images with multi-round re\ufb01nements offers the potential bene\ufb01t of reducing the ambiguity\nof each query but also raises challenges on how to integrate user queries from multiple rounds. Our\nmodel is inspired by the observation that users naturally underspecify in their queries by referring\nto local regions of the target image. We aim to capture these region level alignments by learning to\nmap text queries {st}T\ni=1 and {vj}N\nj=1\nrespectively, and computing the matching score of {st}T\nt=1 and I by measuring and aggregating\n\ufb01ne-grained similarities between {xi}M\nj=1. Figure 2 provides an overview of our model.\n\nt=1 and the target image I into two sets of latent vectors {xi}M\n\ni=1 and {vj}N\n\n3.1\n\nImage representation\n\nTo identify candidate regions referred in the queries, we follow [1, 21]. For each image I, we \ufb01rst\ndetect the potential objects and salient stuff using the FasterRCNN detector [27]. Corresponding\n\n3\n\n\fFigure 2: Overview of our model. Drill-down maintains a \ufb01xed set of state vectors X, modeling the\nhistorical context of the user queries. Given a new query qt, our model selects and updates one of\nthe state vectors. The updated state vectors Xt and image region features are then projected to a\ncross-modal embedding space to measure the \ufb01ne-grained alignment between each region-state pair.\n\nfeatures {cj} are extracted from the ROI pooling layer of the detector. In practice, we leverage the\nobject detector provided by [1], which is pre-trained on Visual Genome [20] with 1600 prede\ufb01ned\nobject and stuff classes. A linear projection vj = WI cj + bI is applied to reduce {cj} into D-\ndimensional latent vectors V = {vj}N\nj=1, vj \u2208 RD. Here N is the number of regions in each image.\nThe learnable parameters for the image representation {WI , bI} are denoted as \u03b8I.\n\n3.2 Query representation\n\nSupporting multi-round retrieval requires a state representation for integrating the queries from\nmultiple turns. Solutions adopted by existing methods include applying a single recurrent network\nto the concatenation of all queries [9] or a hierarchical recurrent network [7, 31, 22, 10] modeling\nindividual query and historical context in separate recurrent modules. These approaches produce\na single latent vector which aggregates all queries. While state-of-the-art models [22, 10] show\nremarkable performance on domains such as fashion product search, we demonstrate that currently\nused single-vector representations are not the most effective for capturing complex scenes with\nmultiple objects. Speci\ufb01cally, as image features used in existing methods are typically extracted from\nthe penultimate layer of a pre-trained image classi\ufb01cation or object detection model, input instances\nof the same or very similar categories activate the same feature units in the extracted feature space.\nTherefore, it is nontrivial for these latent representations to encode and distinguish multiple entities\nfrom the same or very similar categories (i.e. multiple person instances).\nWe propose to maintain a set of latent representations X = {xi}M\ni=1, xi \u2208 RD for multiple turn\nqueries. Here M is the number of latent vectors. This parameter represents the computational\nbudget, since retrieval time will depend on the compactness of this representation. While users\nmight provide a general image description in the \ufb01rst round of querying, subsequent queries typically\ndescribe more speci\ufb01c regions. We aim at \ufb01nding a good alignment between queries and image region\nrepresentations {vj}N\ni=1 should learn to group and encode the input queries\ninto visually discriminative representations referring to distinct image regions. In the remaining of\nthe section, we \ufb01rst introduce the cross modal similarity formula used in our model. We then explain\nhow to update the state representations {xi}M\nt=1 so as to optimize their\nmatching score with the target image.\n\ni=1 from the queries {st}T\n\nj=1. An ideal set of {xi}M\n\n3.3 Cross modal similarity\nTo measure the similarity of X = {xi}M\ni=1 and V = {vj}N\nj=1, we \ufb01rst compute the cosine similarity\ni vj/(cid:107)xi(cid:107)(cid:107)vj(cid:107), where (cid:107).(cid:107) denotes the L2\nof each possible state-region pair (xi, vj): s(xi, vj) = xT\nnorm. Given s(xi, vj), we de\ufb01ne the similarity s(xi, I) between a state vector xi and the target\nimage I as\n\nN(cid:88)\n\nk=1\n\n1\nN\n\ns(xi, I) =\n\n\u03b1iks(xi, vk), \u03b1ik =\n\n(cid:80)N\n\nexp(s(xi, vk)/\u03c3)\nj exp(s(xi, vj)/\u03c3)\n\n(1)\n\n4\n\n(1) red brick of fireplace(2) china plates and glasses\u2026(t-1) flowers on the dining table(t) candle style chandelier hanging down from ceilingQuery EncoderQueriesFasterRCNN!(#,%)RegionFeaturesCross Modal SimilarityState Vectors (t)GRU'(1) red brick of fireplace(2) china plates and glasses(3) group of three candle sticks on mantel(4) flowers on the dining table(5) candle style chandelier hanging down from ceiling(6) wooden chairs on the carpetNew QueryState Vectors ()*+GRUSentence Rep. ,)State Vectors ()FasterRCNN!(#,%)RegionFeaturesCross Modal Similarity'\fcosine similarity of xi and a context vector(cid:80)N\n\nHere \u03c3 is a temperature hyper-parameter. Note that this formulation is similar to measuring the\nk=1 \u03b1ikvk from an attention module [24, 21]. The\ncross modal similarity between the state vectors X = {xi}M\ni=1 and the target image I is de\ufb01ned as\ns(X, I) = 1\nM\n\nk=1 s(xk, I).\n\n(cid:80)M\n\n3.4 Query encoding\n\nGiven a query input st at time t, our model maps each word token wk in st to an E-dimensional\nvector via a linear projection: ek = WEwk, ek \u2208 RE, k = 1 ,\u00b7\u00b7\u00b7 , K, then generates the\nsentence embedding via a uni-directional recurrent network \u03c6 with gated recurrent units (GRU) as:\nhk = \u03c6(ek, hk\u22121), hk \u2208 RD. The \ufb01rst hidden state of \u03c6 is initialized as a zero vector, while the last\nhidden state is treated as the sentence representation: qt = hK. We also explore using a bidirectional\nencoder but \ufb01nd no improvement. Given the assumption that each text query describes a sub-region\nof the image, each qt only updates a subset of the state vectors. In this work, we focus on a simpli\ufb01ed\nk \u2208 Xt\u22121. In detail, given the text query\nscenario where each qt only updates a single state vector xt\u22121\n}M\nqt at time step t, our model samples xt\u22121\nbased on the probability:\n\nfrom the previous state vector set Xt\u22121 = {xt\u22121\n\ni=1\n\nk\n\ni\n\n\u03c0(xt\u22121\n\nk\n\n|Xt\u22121, qt) =\n\n(cid:80)\n\n1(xt\u22121\n\nk =\u2205)\n1(xt\u22121\nj =\u2205)\n\nj\n\n(cid:80)\nexp(f (xt\u22121\nj exp(f (xt\u22121\n\nk\n\nj\n\n,qt))\n\n,qt))\n\nif Xt\u22121 has an empty vector\n\notherwise\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(2)\n\n(3)\n\n(4)\n\nf (xt\u22121\n\nk\n\n, qt) = W 3\n\n\u03c0 (\u03b4(W 2\n\n\u03c0 (\u03b4(W 1\n\n\u03c0 [xt\u22121\n\nk\n\n; qt] + b1\n\n\u03c0)) + b2\n\n\u03c0)) + b3\n\u03c0,\n\nj\n\nk\n\n\u03c0 ,\u2208 RD\u00d7D, W 3\n\n\u03c0 \u2208 RD\u00d72D, W 2\n\nj = \u2205) is an indicator function which returns 1 if xt\u22121\nwhere 1(xt\u22121\nis an empty vector and 0 otherwise.\nf (\u00b7) is a multilayer perceptron mapping the concatenation of xt\u22121\nand qt into a scalar value. Here\n\u03c0 \u2208 RD,\n\u03b4 is the ReLU activation function, W 1\n\u03c0 \u2208 R are model parameters. An empty state vector is initialized with zero values. Ideally, an\nb3\nexpressive sample policy should learn to allocate a new state vector when necessary. However, we\nempirically \ufb01nd it bene\ufb01cial to update qt to an empty state vector whenever possible. Once xt\u22121\nis\nsampled, we update this state vector using a single uni-directional gated recurrent unit cell (GRU\nCell) \u03c4: xt\n). Note that our formulation is similar to a hard attention module [37].\nLeveraging a soft attention is possible, but it is more computationally expensive as it would need\nto update all state vectors. Our state vector update mechanism is inspired by the knowledge base\nmethods with external memory [22]. Our method can be interpreted as building a knowledge base\nmemory online from scratch, only from the query context, which can be trained end-to-end with\nother modules. We denote the learnable parameters for the state vector update policy function \u03c0(\u00b7) as\n\u03b8\u03c0 = {W 1\n\n\u03c0}, and for the rest modules as \u03b8q = {WE, \u03c6, \u03c4}.\n\nk = \u03c4 (qt, xt\u22121\n\n\u03c0 \u2208 R1\u00d7D, b1\n\n\u03c0 , W 3\n\n\u03c0 , W 2\n\n\u03c0 , b1\n\n\u03c0, b3\n\n\u03c0, b2\n\n\u03c0, b2\n\nk\n\nk\n\n3.5 End-to-end training\n\nOur model is trained to optimize \u03b8I, \u03b8\u03c0 and \u03b8q so as to achieve high similarity score between the\nqueries {st}T\nt=1 and the target image I. Thus, we follow [9, 21] and adopt a triplet loss on s(X, I)\nwith hard negatives:\n\nLe = argmin\n\n(cid:96)(X, I)\n[\u03b1 + s(X, I(cid:48)) \u2212 s(X, I)]+ + max\n\n\u03b8I ,\u03b8q\n\nX,I\n\nX(cid:48) [\u03b1 + s(X(cid:48), I) \u2212 s(X, I)]+\n\n(cid:96)(X, I) = max\n\nI(cid:48)\n\nHere, \u03b1 is a margin parameter, [\u00b7]+ \u2261 max(\u00b7, 0). I(cid:48) and X(cid:48) are decoy images and state vectors\nwithin the same mini-batch as the ground-truth pair (X, I) during training. Note that Le will only\noptimize the parameters \u03b8I and \u03b8q. Directly optimizing \u03b8\u03c0 is dif\ufb01cult as sampling from Equation 2\nis non-differentiable. We propose to train the policy parameters via Reinforcement Learning (RL).\n\n(cid:88)\n\n5\n\n\fi}M\nFormally, the state in our RL formulation is the set of state vectors Xt = {xt\ni=1, and the action\nk \u2208 {1, ..., M} is to select the state vector xt\nk from Xt when fusing information from the embedded\nquery vector qt+1. The RL objective is to maximize the expected cumulative discounted rewards, so\nin our case we de\ufb01ne the reward function as the similarity between the state vectors Xt and the image\nI, i.e. s(Xt, I). Note that our reward function evaluates the potential similarity at all future time step\ninstead of only the last step T , encouraging the model to \ufb01nd the target image with fewer turns.\n\nSupervised pre-training As optimizing the sampling policy requires reward signals from\nthe retrieval environment, we pre-train the model by optimizing Le with a \ufb01xed policy:\n|Xt\u22121, qt) = 1(k \u2261 t (mod M)), where 1(\u00b7) is an indicator function and M is the number of\n\u03c0(xt\u22121\nstate vectors. Intuitively, this policy circularly updates the state vectors in order.\n\nk\n\nthe policy. Speci\ufb01cally, we estimate the state-action value Q(Xt, k) =(cid:80)T\u22121\n\nJoint optimization Given the pre-trained environment, we then jointly optimize the sampling policy\nand the other modules (i.e. \u03b8I , \u03b8q and \u03b8\u03c0). Because the next state Xt+1 is a deterministic function\ngiven the current state Xt and action k, we adopt the policy improvement strategy from [10] to update\nt(cid:48)=t \u03b3t(cid:48)\u2212ts(Xt(cid:48)+1, I) for\neach state vector selection action k by sampling one look-ahead trajectory. \u03b3 is the discount factor.\nThe policy is then optimized to predict the most rewarding action k\u2217 = argmaxk Q(Xt, k) via a\ncross entropy loss:\n\n(cid:88)\n\nL\u03c0 = argmin\n\n\u03b8\u03c0\n\nXt,qt+1\n\n\u2212 log(\u03c0(xt\n\nk\u2217|Xt, qt+1; \u03b8\u03c0))\n\n(5)\n\n(cid:80)\nX\u2217,I (cid:96)(X\u2217, I). The model is trained with the multi-task loss: L = L\u2217\n\nWe also jointly \ufb01netune \u03b8I and \u03b8q by applying Le on the rollout state vectors X\u2217: L\u2217\nargmin\u03b8I ,\u03b8q\n\u00b5 is a scalar factor determining the trade-off between the two terms.\n\ne =\ne + \u00b5L\u03c0, where\n\n4 Experiments\n\nDataset We evaluate the performance of our method on the Visual Genome dataset [20]. Each\nimage in Visual Genome is annotated with multiple region captions. We preprocess the data by\nremoving duplicate region captions (e.g. multiple captions that are exactly the same), and images\nwith less than 10 region captions. This preprocessing results in 105,414 image samples, which are\nfurther split into 92,105/5,000/9,896 for training/validation/testing. We also ensure that the images in\nthe test split are not used for the training of the object detector [1]. All the evaluations, including the\nhuman subject study, are performed on the test split, which contains 9,896 images. We use region\ncaptions as queries to train our model, thus bypassing the challenging issue of data collection for\nthis task. The vocabulary of the queries is built with the words that appear more than 10 times in all\nregion captions, resulting in a vocabulary size of 14,284. During training, queries and their orders are\nrandomly sampled. During validation and testing, the queries and their orders are kept \ufb01xed.\n\nBaselines We compare our method with four baseline models: (1) HRE: a hierarchical recurrent\nencoder network, which is commonly adopted by recent dialog based approaches [31, 22, 10]. We\nconsider the framework using text queries as context, which consists of a sentence encoder, a context\nencoder and an image encoder. The sentence encoder has the same word embedding (e.g. the linear\nprojection WE) and sentence embedding (e.g. the \u03c6 function) as the proposed model. The context\nencoder is a uni-directional GRU network \u03c8 that sequentially integrates the sentence features qt from\n\u03c6 and generates the \ufb01nal query feature \u00afxt : \u00afxt = \u03c8(qt, \u00afxt\u22121). \u00afx0 is initialized as a zero vector. The\nimage encoder maps the mean-pooled features of ResNet152 [13] into a one-dimensional feature\nvector \u00afv via a linear projection. The ResNet model is pre-trained on ImageNet [8]. The model is\ntrained to optimize the cosine similarity between \u00afxt and \u00afv by a triplet loss with hard negatives as\nin [9]. (2) R-HRE: a model similar to baseline (1) but is trained with the region features {vj}N\nj=1, as\nin the proposed method. Speci\ufb01cally, the model learns to optimize the similarity term s(\u00afxt, I) de\ufb01ned\nin Eq.(1) by a triplet loss with hard negatives similar to Le on one state vector. (3) R-RE: a model\nsimilar to baseline (2) but instead of using a hierarchical text encoder, this baseline uses a single\nuni-directional GRU network which encodes the concatenation of the queries. (4) R-RankFusion: a\n\n6\n\n\fFigure 3: Quantitative evaluation of our models and the baselines. (A) Comparison of models\nusing query representations of the same memory size; (B) Comparison of the models using query\nrepresentations of different memory sizes. The horizontal axis represents the query turn.\n\nMethods\nDrill-down3\u00d7128 / 3\u00d7256 / 5\u00d7256 / 10\u00d7256\n# Query Rep.\n# Image Rep. 1280 / 36 \u00d7 1280 36\u00d7640 / 36 \u00d7 1280 36 \u00d7 128 / 36 \u00d7 256 / 36 \u00d7 256 / 36 \u00d7 256\n# Parameters\n\n4861k / 5830k / 5830k / 5830k\n\n384 / 768 / 1280 / 2560\n\nR-HRE640/1280\n\n640 / 1280\n\n9866k / 22820k\n\nHRE/R-RE1280\n\n1280\n\n22820k\n\nTable 1: Sizes of the query/image representations and the parameters in our models and the baselines.\n\nmodel where each query is encoded by a uni-directional GRU network and each image is represented\nas a set of region features {vj}N\nj=1. The ranks of all images are computed separably for each turn.\nThe \ufb01nal ranks of the images are represented as the averages of the per-turn ranks.\n\nImplementation details We try to keep consistent con\ufb01gurations for all the models in our ex-\nperiments to better evaluate the contribution of each component. In particular, all the models are\ntrained with 10-turn queries (T = 10). We use ten turns as we\u2019d like to track and demonstrate\nthe performance of all methods in both short-term and long-term scenarios. For each image, we\nextract the top 36 regions (N = 36) detected by a pretrained Faster RCNN model, following [1].\nEach embeded word vector has a dimension of 300 (E = 300). In all our experiments, we set the\ntemperature parameter \u03c3 to 9, the margin parameter \u03b1 to 0.2, the discount factor \u03b3 to 1.0, and the\ntrade-off factor \u00b5 to 0.1. For optimization, we use Adam [16] with an initial learning rate of 2e \u2212 4\nand a batch size of 128. We clip the gradients in the back-propagation such that the norm of the\ngradients is not larger than 10. All models are trained with at most 300 epochs, validated after each\nepoch. The models which perform best on the validation set are used for evaluation.\n\nEvaluation metrics To measure the retrieval performance, we use the common R@K metric, i.e.,\nrecall at K - the ratio of queries for which the target image is among the top-K retrieved images. The\nR@1, R@5 and R@10 scores at each turn are reported as shown in Fig. 3.\n\n4.1 Results on simulated user queries\n\nDue to the lack of existing benchmarks for multiple turn image retrieval, we use the annotated\nregion captions in Visual Genome to mimic the user queries. As region captions focus more on\ninvariant information, such as image contents, and convey fewer irrelevant signals, such as different\nspeaking/writing styles, they could be seen as the common \"abstracts\" of real queries in different\nforms. While we agree that strong supervisory signals such as real user queries could bridge the\ndomain gap and would like to explore further in this direction, we choose at this stage to use only\n\"weak but free\" signals and investigate their potentials of being generalized to real scenarios. First,\nwe compare our method against the baseline models when using query representations of the same\nmemory size. In particular, we use 5 state vectors in our model (M = 5), each with a dimension of\n256. Accordingly, the baseline models use a 1280-d query vector. Figure 3(A) shows the per-turn\n\n7\n\n\fFigure 4: Qualitative examples of Drill-down3\u00d7128. The sequential queries and the corresponding\nstate vectors used to integrate them are shown on the left; The top-3 regions of the target images\nattended by each state vector are shown on the right, with the same color as the corresponding state\nvector. Note that all these target images rank top-1 given the input queries.\n\nperformance of the models on the test set. Here Drill-down5\u00d7256(FP) indicates the supervised pre-\ntrained model with the \ufb01xed policy, and Drill-down5\u00d7256 indicates the jointly optimized model with a\nlearned policy. Both the R-RE1280 and R-HRE1280 baselines perform better than the HRE1280 model,\ndemonstrating the bene\ufb01t of incorporating region features. R-HRE1280 is superior to R-RE1280,\ndemonstrating the bene\ufb01t of hierarchical context encoding. R-RankFusion1280 performs inferior to\nall other models. Note that it also requires more memory to store the ranks of all images at each turn.\nOur models signi\ufb01cantly outperform all baselines by a large margin. On the other hand, we observe\nthat the performance of our model will degrade when different queries have to share the same state\nvector. For example, after the 5th turn, the Drill-down5\u00d7256(FP) model gains less improvement from\neach new query. Drill-down5\u00d7256 further improves Drill-down5\u00d7256(FP) by learning to distribute\nthe queries into the most rewarding state vectors.\nTo investigate the design space of the query representation, we further explore variants of our model\nwith different numbers of state vectors and feature dimensions. Table 1 shows the sizes of the\nquery/image representations and the parameters used in our models and the baselines. Note that the\nR-RankFusion and R-RE models have the same size of query/image representations and parameters.\nHere Drill-downM\u00d7D indicates the model with M state vectors, each with a dimension of D. As\nshown in Figure 3(B), while both Drill-down and the R-HRE baseline can be improved by increasing\nthe feature dimension, using more state vectors gains signi\ufb01cantly more improvements with the same,\nor even less memory budget. For example, Drill-down3\u00d7128 signi\ufb01cantly outperforms R-HRE1280\nwith 3 times less query features, 10 times less region features and 4 times less parameters. The\nhighest performance is achieved by the model which stores each query in a distinct state vector: 10\nstate vectors for 10-turn queries. Integrating multiple queries into the same state vector could make\nthe model \u201cforget\u201d the responses from earlier turns, especially when they activate the same semantic\nspace as the new query.\nFigure 4 provides qualitative examples of the Drill-down3\u00d7128 model. Here the arrows indicate the\npredicted state vectors used to incorporate the queries. We show the top-3 regions of the target images\nthat have the highest similarity scores with each state vector (illustrated with the same color). We\nobserve that the model tends to group queries with entities that potentially coincide with each other.\nHowever, it could also lead to the \u201cforgetting\u201d of earlier queries. For instance, in the \ufb01rst example,\nwhen aggregating the queries \u201cchild in a stroller\u201d and \u201cwoman in a dress\u201d in order, the model tends\nto focus on \u201cwoman\u201d while forgetting information about \u201cchild\u201d, as \u201cwoman\u201d and \u201cchild\u201d potentially\nactivate the same semantic subspace.\n\n8\n\n\fFigure 5: Examples of real user queries and the top-1 images from Drill-down3\u00d7256.\n\n4.2 Results on real user queries\n\nWe evaluate our method with the queries from crowdsourced human users via a multi-round in-\nteractive system adapted from [3]. Given a target image, a user is asked to search for it by pro-\nviding descriptions of the image content. The system shows top-5 retrieved images to the user\nper turn as context so that the user can improve the results by providing additional descriptions.\nThis process is repeated until the image is found or it\nreaches 5 turns. We sample 80 random images from\nthe test set and evaluate HRED1280, R-HRED1280\nand Drill-down3\u00d7256 on these images respectively.\nEach image is viewed by 3 different users. For each\nmodel, the best result on each image is selected across\nusers to ensure high quality responses. As shown in\nFigure 6, most users (> 80%) successfully \ufb01nd the\ntarget image within 5 turns, demonstrating the ef-\nfectiveness of the multi-round search paradigm and\nthe quality of using region captions for training. In\nparticular, Drill-down3\u00d7256 consistently outperforms\nHRE1280 and R-HRE1280 on all evaluation metrics.\nOn the other hand, as real user queries have more\n\ufb02exible forms, e.g.\nlonger sentences, repeated de-\nscriptions of the same region, etc, we also observe\nsmaller performance gaps between our method and\nthe baselines. We believe further efforts such as real\nquery data collection are needed to systematically\n\ufb01ll this domain gap. Figure 5 shows example real user queries and the retrieval sequences using\nDrill-down3\u00d7256.\n\nFigure 6: Human subject evaluation of\nthe HRE1280, R-HRE1280 baselines and our\nDrill-down3\u00d7256 model.\n\n5 Conclusion\n\nWe present Drill-down, a framework that is ef\ufb01cient and effective in interactive retrieval of speci\ufb01c\nimages of complex scenes. Our method explores in depth and addresses several challenges in multiple\nround retrievals with natural language queries such as the compactness of query state representations,\nand the need for region-aware features. It also demonstrates the effectiveness of training a retrieval\nmodel with region captions as queries for interactive image search under human evaluations.\n\nAcknowledgements We thank our anonymous reviewers for helpful feedback. This work was\nfunded by a research grant from SAP Research and generous gift funding from SAP Research. We\nthank Tassilo Klein and Moin Nabi from SAP Research for their support.\n\n9\n\n\fReferences\n[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei\nZhang. Bottom-up and top-down attention for image captioning and visual question answering. In IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[2] Relja Arandjelovic and Andrew Zisserman. Multiple queries for large scale speci\ufb01c object retrieval. In\n\nBritish Machine Vision Conference (BMVC), pages 1\u201311, 2012.\n\n[3] Paola Cascante-Bonilla, Xuwang Yin, Vicente Ordonez, and Song Feng. Chat-crowd: A dialog-based\nplatform for visual layout composition. In Conference of the North American Chapter of the Association\nfor Computational Linguistics (NAACL-HLT), 2019.\n\n[4] Ning-San Chang and King-Sun Fu. Query-by-pictorial-example. IEEE Trans. Softw. Eng., 6(6):519\u2013524,\n\nNovember 1980.\n\n[5] Ning-San Chang and King-Sun Fu. A relational database system for images. In Pictorial Information\n\nSystems, 1980.\n\n[6] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos\u00e9 M.F. Moura, Devi Parikh, and\nDhruv Batra. Visual Dialog. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2017.\n\n[7] Abhishek Das, Satwik Kottur, Jose M. F. Moura, Stefan Lee, and Dhruv Batra. Learning cooperative visual\ndialog agents with deep reinforcement learning. In IEEE International Conference on Computer Vision\n(ICCV), Oct 2017.\n\n[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.\n\n[9] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic\n\nembeddings with hard negatives. 2018.\n\n[10] Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. Dialog-based\ninteractive image retrieval. In Advances in Neural Information Processing Systems (NeurIPS), pages\n676\u2013686, 2018.\n\n[11] Tanmay Gupta, Kevin J. Shih, Saurabh Singh, and Derek Hoiem. Aligned image-word representations\nimprove inductive transfer across vision-language tasks. In IEEE International Conference on Computer\nVision (ICCV), 2017.\n\n[12] Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S Davis. Learning fashion compatibility with\n\nbidirectional lstms. In ACM Multimedia, 2017.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[14] Andrej Karpathy, Armand Joulin, and Li Fei-Fei. Deep fragment embeddings for bidirectional image\nsentence mapping. In Advances in Neural Information Processing Systems (NeurIPS), pages 1889\u20131897,\n2014.\n\n[15] Chlo\u00e9 Kiddon, Luke S. Zettlemoyer, and Yejin Choi. Globally coherent text generation with neural\n\nchecklist models. In Empirical Methods in Natural Language Processing (EMNLP), 2016.\n\n[16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations (ICLR), 2015.\n\n[17] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with\n\nmultimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.\n\n[18] Adriana Kovashka and Kristen Grauman. Attribute pivots for guiding relevance feedback in image search.\n\nIn IEEE International Conference on Computer Vision (ICCV), December 2013.\n\n[19] Adriana Kovashka, Devi Parikh, and Kristen Grauman. Whittlesearch: Interactive image search with\n\nrelative attribute feedback. International Journal of Computer Vision (IJCV), 115(2):185\u2013210, 2015.\n\n[20] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,\nYannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome:\nConnecting language and vision using crowdsourced dense image annotations. 2016.\n\n10\n\n\f[21] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for\n\nimage-text matching. In European Conference on Computer Vision (ECCV), 2018.\n\n[22] Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-Seng Chua. Knowledge-aware multimodal\n\ndialogue systems. In ACM International Conference on Multimedia (ACM MM), pages 801\u2013809, 2018.\n\n[23] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays,\nPietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. Microsoft COCO: Common objects\nin context. European Conference on Computer Vision (ECCV), 2014.\n\n[24] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based\nneural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages\n1412\u20131421, 2015.\n\n[25] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Hierarchical multimodal lstm for dense\n\nvisual-semantic embedding. In IEEE International Conference on Computer Vision (ICCV), 2017.\n\n[26] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana\nLazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence\nmodels. In IEEE International Conference on Computer Vision (ICCV), pages 2641\u20132649, 2015.\n\n[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection\nwith region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), 2015.\n\n[28] Yong Rui, Thomas S. Huang, and Shih-Fu Chang.\n\nImage retrieval: Current techniques, promising\ndirections, and open issues. Journal of Visual Communication and Image Representation, 10(1):39 \u2013 62,\n1999.\n\n[29] Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. Building\nend-to-end dialogue systems using generative hierarchical neural network models. In AAAI Conference on\nArti\ufb01cial Intelligence (AAAI), pages 3776\u20133783, 2016.\n\n[30] Behjat Siddiquie, Rogerio S Feris, and Larry S Davis. Image ranking and retrieval based on multi-attribute\nqueries. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 801\u2013808, 2011.\n\n[31] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-\nYun Nie. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In ACM\nInternational on Conference on Information and Knowledge Management (CIKM), pages 553\u2013562, 2015.\n\n[32] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), 2015.\n\n[33] Mariya I. Vasileva, Bryan A. Plummer, Krishna Dusad, Shreya Rajpal, Ranjitha Kumar, and David Forsyth.\nLearning type-aware embeddings for fashion compatibility. In European Conference on Computer Vision\n(ECCV), 2018.\n\n[34] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5005\u20135013, 2016.\n\n[35] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In International Conference on\n\nLearning Representations (ICLR), 2015.\n\n[36] Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. Uni\ufb01ed\nvisual-semantic embeddings: Bridging vision and language with structured meaning representations. In\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[37] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel,\nand Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In\nInternational Conference on Machine Learning (ICML), volume 37, pages 2048\u20132057, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1524, "authors": [{"given_name": "Fuwen", "family_name": "Tan", "institution": "University of Virginia"}, {"given_name": "Paola", "family_name": "Cascante-Bonilla", "institution": "University of Virginia"}, {"given_name": "Xiaoxiao", "family_name": "Guo", "institution": "IBM Research"}, {"given_name": "Hui", "family_name": "Wu", "institution": "IBM Research"}, {"given_name": "Song", "family_name": "Feng", "institution": "IBM Research"}, {"given_name": "Vicente", "family_name": "Ordonez", "institution": "University of Virginia"}]}