{"title": "Stacked Semantics-Guided Attention Model for Fine-Grained Zero-Shot Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5995, "page_last": 6004, "abstract": "Zero-Shot Learning (ZSL) is generally achieved via aligning the semantic relationships between the visual features and the corresponding class semantic descriptions. However, using the global features to represent fine-grained images may lead to sub-optimal results since they neglect the discriminative differences of local regions. Besides, different regions contain distinct discriminative information. The important regions should contribute more to the prediction. To this end, we propose a novel stacked semantics-guided attention (S2GA) model to obtain semantic relevant features by using individual class semantic features to progressively guide the visual features to generate an attention map for weighting the importance of different local regions. Feeding both the integrated visual features and the class semantic features into a multi-class classification architecture, the proposed framework can be trained end-to-end. Extensive experimental results on CUB and NABird datasets show that the proposed approach has a consistent improvement on both fine-grained zero-shot classification and retrieval tasks.", "full_text": "Stacked Semantics-Guided Attention Model for\n\nFine-Grained Zero-Shot Learning\n\nYunlong Yu, Zhong Ji\u2217\n\nSchool of Electrical and Information Engineering\n\nTianjin University\n\n{yuyunlong,jizhong}@tju.edu.cn\n\nJichang Guo, Yanwei Pang\n\nSchool of Electrical and Information Engineering\n\nTianjin University\n\n{jcguo,pyw}@tju.edu.cn\n\nYanwei Fu\n\nSchool of Data Science\n\nFudan University\n\nAITRICS\n\nyanweifu@fudan.edu.cn\n\nZhongfei (Mark) Zhang\n\nComputer Science Department\n\nBinghamton University\n\nzhongfei@cs.binghamton.edu\n\nAbstract\n\nZero-Shot Learning (ZSL) is generally achieved via aligning the semantic relation-\nships between the visual features and the corresponding class semantic descriptions.\nHowever, using the global features to represent \ufb01ne-grained images may lead\nto sub-optimal results since they neglect the discriminative differences of local\nregions. Besides, different regions contain distinct discriminative information.\nThe important regions should contribute more to the prediction. To this end, we\npropose a novel stacked semantics-guided attention (S2GA) model to obtain seman-\ntic relevant features by using individual class semantic features to progressively\nguide the visual features to generate an attention map for weighting the importance\nof different local regions. Feeding both the integrated visual features and the\nclass semantic features into a multi-class classi\ufb01cation architecture, the proposed\nframework can be trained end-to-end. Extensive experimental results on CUB and\nNABird datasets show that the proposed approach has a consistent improvement\non both \ufb01ne-grained zero-shot classi\ufb01cation and retrieval tasks.\n\n1\n\nIntroduction\n\nTraditional object classi\ufb01cation tasks require the test classes to be identical or a subset of the training\nclasses. However, the categories in the reality have a long-tailed distribution, which means that\nno classi\ufb01cation model could cover all the categories in the real world. Targeting on extending\nconventional classi\ufb01cation models to unseen classes, Zero-Shot Learning (ZSL) [7, 8, 18, 29, 30] has\nattracted a lot of interests in the machine learning and computer vision communities.\nThe current approaches formulate ZSL as a visual-semantic alignment problem. In these approaches,\nan image is represented with its global features. Despite good performances on coarse-grained\ndatasets (e.g., Animal with Attribute dataset [10]), the global features have limitations on \ufb01ne-grained\ndatasets since more local discriminative information is required to distinguish classes. As illustrated in\nFig. 1, the global features only capture some holistic information, on the contrary, the region features\ncapture more local information that is relevant to the class semantic descriptions. Consequently, the\nglobal image representations may fail in \ufb01ne-grained ZSL.\n\n\u2217The corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWhen trying to recognize an image from unseen categories, humans tend to focus on the informative\nregions based on the key class semantic descriptions. Besides, humans achieve the semantic alignment\nby ruling out the irrelevant visual regions, and locating the most relevant ones in a gradual way.\nMotivated by the above observations and the attention mechanisms that can highlight important local\ninformation and neglect irrelevant information of an image, we propose a novel stacked attention-\nbased network to integrate both the global and discriminative local features to represent an image via\nprogressively allocating different weights to different local visual regions based on their relevances to\nthe class semantic descriptions.\nAs shown in Fig. 2, the proposed approach\ncontains an image featurization part, an at-\ntention extraction part, and a visual-semantic\nmatching part. For the image featurization\npart, we extract the local features that retain\nthe crucial spatial information of an image\nfor the subsequent attention part. It should\nbe noted that the region features can also be\ncompressed into a global one by concatenat-\ning or averaging all the local region features.\nThe attention extraction part is the core of\nthe proposed framework, which progressive-\nly allocates the importance weights to differ-\nent visual regions based on their relevance\nto the class semantic features. The visual-\nsemantic matching part is a two-layer neural\nnetwork to embed both the class semantic\nfeatures and the integrated visual features of both the global and local weighted visual features into a\nmulti-class classi\ufb01cation framework.\nIn summary, the contributions of this work are three-fold:\n\nFigure 1: An example of the activation mappings of\nboth the global and region features.\n\n1. We apply the attention mechanism for ZSL to address the issue of irrelevant or noisy\ninformation brought by the global features using the local region features of an image. To\nthe best of our knowledge, this is the \ufb01rst work to apply attention mechanism for ZSL.\n\n2. To effectively obtain the attention map to distribute weights for local region features, we\npropose a stacked attention network guided by the class semantic features for ZSL. It\nintegrates both the global visual features and the weighted local discriminant features to\nrepresent an image by allocating different weights for different local regions based on their\nrelevances to the class semantic features.\n\n3. The whole proposed framework can be trained end-to-end by feeding both the visual\nrepresentations and the class semantic features into a multi-class classi\ufb01cation architecture.\n\nIn the experiments, we evaluate the proposed S2GA framework for \ufb01ne-grained ZSL on two bird\ndatasets: Caltech UCSD Birds-2011 (CUB) [22], and North America Birds (NABird) [21]. The\nexperimental results show that our approach signi\ufb01cantly outperforms the state-of-the-art methods\nwith large margins on both zero-shot classi\ufb01cation and retrieval tasks.\n\n2 Related Work\n\n2.1 Region features-based Zero-Shot approaches\n\nA limitation of most existing ZSL approaches for \ufb01ne-grained datasets is to use global features to\nrepresent the visual images. This may feed irrelevant or noisy information to the prediction stage.\nRecently, [1, 6, 33] adopt local region features to represent images that are more relevant to the class\nsemantic descriptions. Speci\ufb01cally, [6] proposed to learn region-speci\ufb01c classi\ufb01er to connect text\nterms to its relevant regions and suppress connections to non-visual text terms without any part-text\nannotations. By concatenating the feature vectors of each visual region as the visual representations\nof an image, Zhu et al. [33] proposed a simple generative method to generate synthesised visual\nfeatures using the text descriptions about an unseen class. Inspired by the effectiveness of region\n\n2\n\n(a) Original image(b) Heatmap of the global features(c) Heatmap of the region features\fFigure 2: The framework of the proposed S2GA approach. The green box denotes the image\nfeaturization part; the blue box is two-layer Semantics-Guided Attention (SGA) part to distribute\ndifferent weights to different relevant visual features based on the class semantic descriptions; the red\nbox denotes the visual-semantic matching part to jointly embed both the class semantic features and\nthe integrated features of both the global and local weighted features into a multi-class classi\ufb01cation\nframework.\n\nfeatures for ZSL, we also use the region features in this work. Different from [1, 6, 33], we argue\nthat important regions should contribute more to the prediction and design an attention method to\ndistribute different weights for different regions according to their relevance with class semantic\nfeatures, and integrate both the global visual features and the weighted region features into more\nsemantics-relevant features to represent images.\n\n2.2 Attention Model\n\nThe aim of the attention mechanisms is to either highlight important local information or alleviate\nthe issue of irrelevant or noisy information brought by the global features. Due to their validity and\ngenerality, the attention mechanisms have been widely adopted in various computer vision tasks,\ne.g., object classi\ufb01cation [4], machine translation [9], image caption [13, 27], and visual question\nanswering [24, 28].\nTo the best of our knowledge, there is no work to apply the attention mechanism to ZSL task. In\nthis work, we design a stacked attention network to assign different importance weights to features\nof different local regions to obtain a more semantics-relevant feature representation. One of the\nmost closely related work, Yang et al. [26] proposed a stacked attention network for visual question\nanswering, which directly uses the question to search for the related regions to the answer. However,\nfor ZSL task, since no corresponding class semantic descriptions are provided for testing images, the\ncurrent attention mechanism could not be applied to it directly. To this end, we propose to indirectly\nlearn the attention maps to weight different regions guided by the class semantic descriptions during\ntraining. This is the novelty of the proposed attention method. Besides, we also integrate both the\nweighted region features and the global features to represent image features since the corresponding\nclass semantic descriptions contain both global and local information. In summary, our stacking\nbased attention learning enables learning a hierarchical representation of the attention from both the\nglobal and local features, which was ignored by the existing studies in the attention learning literature\nand is also signi\ufb01cantly different from the method at [23] that does not use stacking mechanism.\n\n3 Semantics-guided Attention Networks\n\nThe visual-semantic matching that measures similarities between the visual and class semantic\nfeatures is a key to address ZSL. However, the visual features extracted from the original images\nand the class semantic features are located in different structural spaces, a simple matching method\nmay not align the semantics well. To narrow the semantic gap between the visual and the class\nsemantic modalities, we propose a semantic-guided attention approach to use the class semantic\ndescriptions to guide the local region features to obtain more semantic-relevant visual features for\nthe subsequent visual-semantic matching. The semantic-guided attention mechanism pinpoints the\n\n3\n\nCNNRegion featuresInput imageGuided lossSGASGAClass semantic featuresSemantic embedding labelAttention featuresAttention featuresEmbedded lossSoftmax\fregions that are highly relevant to the class semantic descriptions and \ufb01lters out the noisy regions.\nThe overall architecture of the proposed semantics-guided attention networks is shown in Fig. 2. In\nthis section, we \ufb01rst describe the proposed S2GA model and the visual-semantic matching model,\nand then apply it to \ufb01ne-grained ZSL.\n\n3.1 Stacked semantics-guided attention networks\n\nGiven the local features of an image and its corresponding class semantic vector, the attention\nnetworks distribute different weights for each visual region vector via multi-step attention layers, and\nintegrate both the global and weighted local features to obtain more semantics-relevant representations\nfor images. Speci\ufb01cally, we propose an attention approach using the local region features to gradually\n\ufb01lter out noises and weight the regions that are highly relevant to the class semantic descriptions via\nmultiple (stacked) attention layers. It consists of multi-step attention layers to generate the distribution\nweights for relevant regions. Each attention map is obtained to measure the relevance between the\nregion features and the class semantic features under the guidance of the class semantic features, so\nwe call our approach stacked semantics-guided attention(S2GA) model.\nFig. 3 shows the illustration of a semantic\nguided-attention layer. Given the image lo-\ncal feature representations VI and its corre-\nsponding class semantic vector s, the atten-\ntion map is obtained with two separated net-\nworks. The \ufb01rst network is denoted as local\nembedding network, which feeds the local\nfeature representations into a latent space\nthrough a simple two-layer neural network.\nThe second network is named as semantic-\nguided network. It \ufb01rst compresses all the\nvisual region features to an integrate vector\nvG and then feeds it into the same laten-\nt space with a three-layer neural network.\nSpeci\ufb01cally, the output of the mid-layer is\nforced to be close to the corresponding class semantic feature. In this way, the class semantic\ninformation is embedded into the network, which guides to obtain the attention map. Then, a softmax\nfunction is used to generate the attention distribution over the regions of the image:\n\nFigure 3: An illustration of a semantics-guided atten-\ntion (SGA) layer.\n\nhA = tanh(f (VI ) \u2295 g(vG)),\npI = softmax(WP hA + bP ),\n\n(1)\n(2)\nwhere VI \u2208 Rp\u00d7m, p is the feature dimensionality of each region and m is the number of the image\nregions, vG \u2208 Rp is the fused visual image vector that is the uniform average of all the image region\nvectors VI, \u2295 denotes the multiplication between each column of the matrix and the vector, which\nis performed by element-wise multiplying each column of the matrix by the vector. hA \u2208 Rd\u00d7m is\nthe fused vector in the latent space, pI \u2208 Rm is an m dimensional vector, which corresponds to the\nattention probability of each image region, f and g are two different networks:\n\nf (vI ) = h(WI,AvI ),\n\n(3)\n(4)\nwhere h is a nonlinear function (we use Recti\ufb01ed Linear Unit (ReLU) in the experiments). WI,A \u2208\nRd\u00d7p, WG,S \u2208 Rq\u00d7p, WG,A \u2208 Rd\u00d7q are the learned parameters, q and d are the dimensionalities of\nthe class semantic space and the latent space, respectively. To embed the class semantic information\ninto the attention network, the output of the second-layer of g is forced to be close to the corresponding\nclass semantic features, which is formulated as:\n\ng(vG) = h(WG,Ah(WG,SvG)),\n\nmin LossG = (cid:107)h(WG,SvG) \u2212 s(cid:107)2.\n\n(5)\n\nBased on the attention distribution, we obtain the weighted feature vector of each image region,\n\u02dcvi = pivi, where vi is the feature vector of i-th region. We then combine the weighted vector \u02dcvi\nwith the previous region feature vector to form a re\ufb01ned region vector ui = \u02dcvi + vi. ui is regarded\nas a re\ufb01ned vector since it encodes both visual information and the class semantic information.\n\n4\n\nFusionSoftmaxRegion featuresClass semantic featureGuided lossWeightsIVGvAhAttention featuresfgIUsIp\fCompared with the approaches that simply use the global image vector, the attention method con-\nstructs a more informative u since higher weights are put on the visual regions that are more relevant\nto the class semantic descriptions. However, for some complicated cases, a single attention network\nis not suf\ufb01cient to locate the correct region for class semantic descriptions. Therefore, we iterate\nthe above semantics-guided attention process using multiple attention layers, each extracting more\n\ufb01ne-grained visual attention information for class semantic descriptions. Speci\ufb01cally, for the k-th\nattention layer, we compute:\n\n(6)\n(7)\nG are initialized to be VI and vG, respectively. Then the weighted image region\n\nA = tanh(f (Uk\nhk\nI = softmax(Wk\npk\n\nwhere U0\nvector is added to the previous image region feature to form a new image vector:\n\nG)),\nP ),\n\nI and u0\n\nI ) \u2295 g(uk\nP hk\nA + bk\n\n\u02dcuk\ni = pk\ni = \u02dcuk\nuk\n\ni uk\ni ,\ni + uk\ni .\n\n(8)\n(9)\n\nIt should be noted that in every attention layer the class semantic information is embedded into the\nnetwork. We repeat this process K times to obtain uK\nG . By integrating all the region features, we\nobtain the \ufb01nal image representations uG for the embedding network.\n\n3.2 Visual-semantic matching model\n\nTo connect the visual features and class semantic features, we use a two-layer network to embed the\nclass semantic feature into the visual space:\n\nmin LossA = (cid:107)vs \u2212 uG(cid:107)2.\n\nvs = h(WEs + bE),\n\n(10)\nwhere WE \u2208 Rp\u00d7q, bE \u2208 Rq are the embedding matrix and bias, and h denotes the ReLU function.\nTo align the common semantic information between the visual space and the class semantic space,\nthe differences between the attention visual feature uG and its corresponding feature embedding vs\nof class semantic descriptions should be small, and the objective function is formulated as a squared\nloss:\n\n(11)\nIt should be noted that the contrastive loss is also an alternative to align the visual-semantic interac-\ntions, which is speci\ufb01cally formulated as:\n\nmin LossA = max{0, m + d(vs; uG) \u2212 d(vneg\n\n(12)\ns\nwhere m is a margin which we \ufb01x to 0.1 in the experiments; vneg\nis the synthesized visual features\nfrom the other class semantic features; d(i; j) is a metric function that measures the distance between\nvectors i and j. While in our experiments we use a square Euclidean distance d(i; j) = (cid:107)i \u2212 j(cid:107)2, any\nother differentiable distance functions can also be used here.\nConsidering that the approach with the squared loss is more ef\ufb01cient, if not speci\ufb01ed, S2GA is\nreferred to the approach with the squared loss.\nA multi-class classi\ufb01er with softmax activation, which is trained on the \ufb01nal visual attention features\nas well as the class semantic features, predicts the class label of the input image. The predicted label\nis the index with the maximum probability:\n\n; uG)},\n\n(13)\nwhere VS is the embedding matrix of the collection of all the seen class semantic features, which is\nobtained in Eq. 10.\n\nc\u2217 = arg max pc, s.t. pc = softmax(VSuG),\n\ns\n\n3.3 Apply to ZSL\n\nGiven a set of semantic features ST of candidate classes and a test instance vt, ZSL is achieved via\nthree steps. First, the test instance is fed into the attention network to obtain the attention feature ut.\nThen, the semantic features are embedded into the visual space in Eq. 10 to obtain VT . After that,\nthe classi\ufb01cation of the test instance is achieved by simply calculating its distance to the semantic\nembedding features VT in the visual space:\n\nc\u2217\nt = arg min\n\nD(ut, VT )\n\n(14)\n\nt\n\n5\n\n\fMethod\nMFMR-joint\u2020 [25]\nDeep-SCoRe [12]\nSynCstruct [5]\nDEM [32]\nESZSL [15]\nRelation-Net [19]\nSJE [3]\nMCZSL [1]\nS2GA\nS2GA-CL\n\nPerformance\n\n53.6\n59.5\n54.4\n58.3\n53.1\n62.0\n\nSI\nF\nA\nV\nA\nV\nA\nG\nA\nG\nA\nG\nG\nA\nA/W 55.3/28.4\nG\nGTA A/W 56.5/32.1\nGTA A/W 75.3/46.9\nGTA A/W 76.8/48.2\n\nTable 1: Performance evaluation on CUB in classi\ufb01cation accuracy (%). S2GA-CL is the approach\nwith the contractive loss; V and G are short for VGGNet and GoogleNet feature representations. A\nand W are short for attribute space and Word2Vec space, respectively. MFMR-joint\u2020 is a transductive\napproach in which the testing instances are available in the training stage.\n\n4 Experiments\n\nIn this section, we carry out several experiments to evaluate the proposed S2GA networks on both\nzero-shot classi\ufb01cation and zero-shot retrieval tasks.\n\n4.1 Experimental setup\n\nDatasets: Following [6], we conduct experiments on two \ufb01ne-grained bird datasets, CUB [22] and\nNABirds [21]. Speci\ufb01cally, CUB dataset contains 200 categories of bird species with a total of\n11,788 images. Each category is annotated with 312 attributes. Besides, the local regions of the\nbird in each image are annotated with locations by experts. Thus, we can either directly extract\nthe local image features using the ground-truth location annotations or indirectly extract the local\nimage features using the SPDA-CNN framework [31] to detect the important regions followed a\nsub-network that uses a 3\u00d73 ROI to pool each region for a 512-d feature. We call the features based\non these two different strategies as \u201cGTA\u201d and \u201cDET\u201d, respectively. Speci\ufb01cally, we extract the\nfeatures of 7 local regions to represent each CUB image. The dimensionality of each region is 512.\nThese 7 regions are \u201chead\u201d, \u201cback\u201d, \u201cbelly\u201d, \u201cbreast\u201d, \u201cleg\u201d, \u201cwing\u201d, and \u201ctail\u201d. Considering that\nthe class semantic descriptions are easily available for CUB dataset, we conduct experiments using\nclass-level attribute, Word2Vec [11] and Term Frequency-Inverse Document Frequency (TF-IDF)\n[16] feature vector as class semantic features, respectively. The dimensionalities of the attribute,\nWord2Vec and TF-IDF are 312, 400 and 11,083, respectively. Compared with CUB dataset, NABirds\ndataset is a larger dataset. It consists of 1,011 classes with 48,562 images. As the same as that in [6],\nwe obtain the \ufb01nal 404 classes after merging the leaf node classes into their parents. The semantic\ndescriptions of each category is an article collected from Wikipedia, and Term Frequency-Inverse\nDocument Frequency (TF-IDF) [16] feature vector is then extracted to represent the class semantic\nfeatures. The dimensionality of TF-IDF for NABird dataset is 13,585. Since no \u201cleg\u201d annotations of\nNABird dataset are available, we extract the features of the remaining six visual regions to represent\nthe local visual features. In order to improve the training ef\ufb01ciency, we use Principal Component\nAnalysis (PCA) to reduce the TF-IDF dimensionality of CUB and NABird datasets to 200 and 400,\nrespectively. All the class semantic features are scaled into [0, 1] with the standard normalization.\nImplementation Details: In our system, the dimensionality d of the hidden layer and the batch size\nare set to 128 and 512, respectively. We directly optimize the sum of these three objective functions\nwithout weights as we found empirically adding weights did not improve the performances. The\nwhole architecture is implemented on the Tensor\ufb02ow and trained end-to-end with \ufb01xed local visual\nfeatures 2. For optimization, the RMSProp method is used with a base learning rate of 10\u22124. The\narchitecture is trained for up to 3,000 iterations until the validation error has not improved in the last\n30 iterations.\n\n2https://github.com/ylytju/sga\n\n6\n\n\fMethod\nESZSL [15]\nZSLNS [14]\nSynCf ast [5]\nZSLPP [6]\nGAA [33]\nS2GA-DET (Ours)\n\nSCS\n28.5\n29.1\n28.0\n37.2\n43.7\n42.9\n\nCUB\n\nNABird\n\nSCE SCS\n24.3\n7.4\n24.5\n7.3\n18.4\n8.6\n9.7\n30.3\n35.6\n10.3\n10.9\n39.4\n\nSCE\n6.3\n6.8\n3.8\n8.1\n8.6\n9.7\n\nTable 2: The per-class average Top-1 accuracy (in %) on CUB and NABird datasets with two different\nsplit settings using DET as visual representations and TF-IDF as class semantic features. We directly\ncopy the performance results of the competitors from [6] and [33].\n\n4.2 Traditional ZSL\n\n4.2.1 Comparison with state-of-the-art approaches\n\nCompared with NABird dataset, CUB dataset is a well-known dataset for ZSL. Thus, we \ufb01rst\nconduct experiments on CUB dataset for comparing with the state-of-the-art approaches. For an easy\ncomparison available with the existing approaches, we use the same split as in [2] with 150 classes\nfor training and 50 disjoint classes for testing. Eight recently published approaches are selected for\ncomparison. Speci\ufb01cally, MCZSL [1] directly uses region annotations to extract image features, and\nthe rest approaches employ two popular CNN architectures to extract image features, i.e., VGGNet\n[17] and GoogleNet [20]. As for the class semantic representations, we use the attributes provided by\nthe CUB dataset and Word2Vec [11] of each class name extracted with the unsupervised language\nprocessing technology. We report the per-class average Top-1 accuracies of different approaches in\nTable 1.\nAlthough performance results in Table 1 vary drastically with different visual representations, it is\nclear that the proposed approach outperforms all the existing methods both with attribute annotations\nand Word2Vec. Speci\ufb01cally, the proposed S2GA approach using GTA achieves impressive gains\nagainst the state-of-the-art approaches for both semantic representations: 13.3% on the attribute and\n14.8% on Word2Vec, respectively. Besides, we observe that S2GA-CL performs slightly better than\nS2GA, which indicates that the contrastive loss may capture more discriminative information.\nTo compare with the existing methods more fairly, we select \ufb01ve state-of-the-art approaches that use\nboth the same visual representations and class semantic representations on both CUB and NABird\ndatasets. All the competitors use the same settings and the features as ours. Speci\ufb01cally, we use the\nVGGNet features based on the detected regions to represent visual image and TF-IDF to represent\nclass semantic features. All the region vectors are concatenated in order into a long vector to represent\neach image. Those features are provided in [6], which are obtained in 3. Following [6, 33], we\nevaluate the approaches on two split settings: Super-Category-Shared (SCS) and Super-Category-\nExclusive (SCE). These two different splits differ in how close the seen classes are related to the\nunseen ones. Speci\ufb01cally, in the SCS-split setting, there exists one or more seen classes belonging\nto the same parent category of each unseen class. This is the traditional ZSL split setting on CUB\ndataset. On the contrary, in the SCE-split setting, the parent categories of unseen classes are exclusive\nto those of the seen classes. Compared with that of SCS-split, the relevance between seen and unseen\nclasses of SCE-split is minimized, which brings more challenges for knowledge transfer.\nTable 2 shows the classi\ufb01cation performances of each split setting on both CUB and NABird datasets.\nFrom the results, we observe that the proposed S2GA approach beats all the competitors on both two\ndatasets under two different split settings except a slight 0.8% worse than GAA [33] under SCS split\nsetting on CUB dataset. Speci\ufb01cally, it obtains 3.8% and 1.1% improvements against the current best\napproach [33] on NABird dataset under both SCS and SCE split settings. Besides, we observe that\nthe classi\ufb01cation accuracies of SCE are dramatically smaller than those of SCS, which indicates that\nthe relativity between the seen classes and unseen classes impacts the classi\ufb01cation performances a\nlot. This is a reasonable phenomenon since the knowledge is easily transferred if some related classes\nor parent classes of unseen classes are available for training.\n\n3https://github.com/EthanZhu90/ZSL_PP_CVPR17\n\n7\n\n\fMethod\nbaseline\n\none-attention layer\ntwo-attention layer\nthree-attention layer\n\nA\n62.5\n67.1\n68.9\n68.5\n\nCUB\nW\n38.4\n40.3\n41.8\n41.6\n\nT\n38.6\n41.8\n42.9\n42.7\n\nNABird\n\nT\n31.6\n36.2\n39.4\n39.6\n\nTable 3: ZSL performances (in %) of the proposed approach with different attention layers on CUB\nand NABird datasets. A - attribute, W - Word2Vec, T - TF-IDF. \u201cbaseline\" is the method without\nattention mechanism.\n\nMethod\nESZSL [15]\nZSLNS [14]\nZSLPP [6]\nGAA [33]\nS2GA (Ours)\n\nCUB\n\nNABird\n\n50% 100% 50% 100%\n20.9\n27.3\n29.5\n22.1\n31.3\n42.0\n48.3\n31.0\n36.6\n47.1\n\n22.7\n23.9\n36.3\n40.3\n42.6\n\n27.8\n27.3\n35.7\n37.8\n42.2\n\nTable 4: Zero-shot retrieval mAP (in %) comparison on CUB and NABird datasets. The results of all\nthe competitors are cited from [33]. All the competitors use the same features.\n\n4.2.2 Gains of the attention mechanism\n\nTo evaluate the effectiveness of the proposed attention mechanism, we conduct experiments on both\nCUB and NABird datasets using local features based on detected regions as visual representations\nunder the SCS split setting. We evaluate the methods without attention mechanism and with different\nlayer attention mechanism. We call the method without attention layer as baseline. The comparison\nresults are shown in Table 3.\nFrom the results, we \ufb01nd that the performances of the methods with attention mechanism are much\nbetter than those without attention mechanism on both CUB and NABird datasets, which veri\ufb01es\nthe effectiveness of the attention mechanism. Besides, the stacked attention mechanism can further\nimprove the performances. However, when the attention layer is larger than two, the performances\ntend to be steady, possibly because there is no further margin of improvement.\n\n4.3 Zero-Shot Retrieval\n\nThe task of zero-shot retrieval is to retrieve the relevant images from unseen class set related to the\nspeci\ufb01ed class semantic descriptions of unseen classes. Here we use the above well trained method to\nembed both all the images of unseen classes and the class semantic descriptions into the integrated\nfeature space spanned both the global features and the weighted local features where the semantic\nsimilarities of the visual and class semantic representations are obtained. For comparing with the\ncompetitors fairly, we use both the same feature representations (visual/textual) as well as the same\nsettings as that in [33] where 50% and 100% of the number of images for each class from the whole\ndataset are ranked based on their \ufb01nal semantic similarity scores.\nTable 4 presents the comparison results of different approaches for mean accuracy precision (mAP)\non CUB and NABird datasets. Note that except for the case of 50% of the number of images for each\nclass on CUB dataset, the proposed approach beats all the competitors. We argue that it bene\ufb01ts from\nboth the \ufb01nal powerful feature representations based on the proposed S2GA mechanism as well as\nthe ef\ufb01cient alignment between the visual modality and the class semantic modality.\nWe also visualize some qualitative results of our approach on two datasets, shown in Fig. 4. We\nobserve that the retrieval performances of different classes vary substantially. For those classes that\nhave good performances, their intra-class variations are subtle. Meanwhile, for those classes that\nhave worse performances, their inter-class variations are small. For example, the top-6 retrieval\nimages of class \u201cIndigo Bunting\u201d are all from their ground truth class since their visual features are\nsimilar. However, the query \u201cBlack-billed Cuckoo\u201d retrieves some instances from its af\ufb01nal class\n\u201cYellow-billed Cuckoo\u201d since their visual features are too similar to distinguish.\n\n8\n\n\fFigure 4: Zero-shot retrieval visualization samples with our approach. The \ufb01rst two rows are classes\nfrom CUB dataset, and the rest are classes from NABird dataset. Correct and incorrect retrieved\ninstances are marked in green and red boxes, respectively.\n\n5 Conclusion\n\nIn this paper, we have proposed a stacked semantics-guided attention approach for \ufb01ne-grained\nzero-shot learning. It progressively assigns weights for different region features guided by class\nsemantic descriptions and integrates both the local and global features to obtain semantic-relevant\nrepresentations for images. The proposed approach is trained end-to-end by feeding the \ufb01nal visual\nfeatures and class semantic features into a joint multi-class classi\ufb01cation network. Experimental\nresults on zero-shot classi\ufb01cation and retrieval tasks show the impressive effectiveness of the proposed\napproach.\n\nAcknowledgments\n\nThis work was supported by the National Natural Science Foundation of China under Grant 61771329,\nthe National Basic Research Program of China (Grant No. 2014CB340403), the National Natural\nScience Foundation of China under Grant 61632018. Yunlong Yu also acknowledges the support of\nChina Scholarship Council. The authors are very grateful for NVIDIA\u2019s support in providing GPUs\nthat made this work possible.\n\nReferences\n[1] Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. Multi-cue zero-shot learning with\n\nstrong supervision. In CVPR, pages 59\u201368, 2016.\n\n[2] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image\n\nclassi\ufb01cation. TPAMI, 38(7):1425\u20131438, 2016.\n\n[3] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embed-\n\ndings for \ufb01ne-grained image classi\ufb01cation. In CVPR, pages 2927\u20132936, 2015.\n\n[4] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention.\n\nIn ICLR, 2015.\n\n[5] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classi\ufb01ers for zero-shot\n\nlearning. In CVPR, pages 5327\u20135336, 2016.\n\n[6] Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed Elgammal. Link the head to the \u201cbeak\u201d: Zero\n\nshot learning from noisy text description at part precision. In CVPR, pages 6288\u20136297, 2017.\n\n[7] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A\n\ndeep visual-semantic embedding model. In NIPS, pages 2121\u20132129, 2013.\n\n[8] Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Transductive multi-view zero-shot\n\nlearning. TPAMI, 37(11):2332\u20132345, 2015.\n\n[9] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,\nFernanda Vi\u00e9gas, Martin Wattenberg, Greg Corrado, et al. Google\u2019s multilingual neural machine translation\nsystem: Enabling zero-shot translation. ACL, 5(1):339\u2013351, 2017.\n\n9\n\nBlack Billed CuckooIndigo BuntingCarolina ChickadeeMottled Duck\f[10] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classi\ufb01cation for zero-shot\n\nvisual object categorization. TPAMI, 36(3):453\u2013465, 2014.\n\n[11] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\n\nwords and phrases and their compositionality. In NIPS, pages 3111\u20133119, 2013.\n\n[12] Pedro Morgado and Nuno Vasconcelos. Semantically consistent regularization for zero-shot recognition.\n\nIn CVPR, 2017.\n\n[13] Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. Areas of attention for image\n\ncaptioning. In ICCV, pages 1251\u20131259, 2017.\n\n[14] Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Less is more: zero-shot learning\n\nfrom online textual documents with noise suppression. In CVPR, pages 2249\u20132257, 2016.\n\n[15] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In\n\nICML, pages 2152\u20132161, 2015.\n\n[16] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information\n\nProcessing and Management, 24(5):513\u2013523, 1987.\n\n[17] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. ICLR, 2014.\n\n[18] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through\n\ncross-modal transfer. In NIPS, pages 935\u2013943, 2013.\n\n[19] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to\n\ncompare: Relation network for few-shot learning. In CVPR, 2018.\n\n[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. In CVPR, pages\n1\u20139, 2015.\n\n[21] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona,\nand Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The\n\ufb01ne print in \ufb01ne-grained dataset collection. In CVPR, pages 595\u2013604, 2015.\n\n[22] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd\n\nbirds-200-2011 dataset. 2011.\n\n[23] Peng Wang, Lingqiao Liu, Chunhua Shen, Zi Huang, Anton van den Hengel, and Heng Tao Shen. Multi-\n\nattention network for one shot learning. In CVPR, pages 22\u201325, 2017.\n\n[24] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for\n\nvisual question answering. In ECCV, pages 451\u2013466, 2016.\n\n[25] Xing Xu, Fumin Shen, Yang Yang, Dongxiang Zhang, Hengtao Shen, and Jingkuan Song. Matrix\ntri-factorization with manifold regularizations for zero-shot learning. In CVPR, pages 2007\u20132016, 2017.\n[26] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image\n\nquestion answering. In CVPR, pages 21\u201329, 2016.\n\n[27] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic\n\nattention. In CVPR, pages 4651\u20134659, 2016.\n\n[28] Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. Multi-level attention networks for visual question\n\nanswering. In CVPR, pages 4187\u20134195, 2017.\n\n[29] Yunlong Yu, Zhong Ji, Jichang Guo, and Zhongfei Zhang. Zero-shot learning via latent space encoding.\n\nTCYB, (99):1\u201312, 2018.\n\n[30] Yunlong Yu, Zhong Ji, Xi Li, Jichang Guo, Zhongfei Zhang, Haibin Ling, and Fei Wu. Transductive\n\nzero-shot learning with a self-training dictionary approach. TCYB, (99):1\u201312, 2018.\n\n[31] Han Zhang, Tao Xu, Mohamed Elhoseiny, Xiaolei Huang, Shaoting Zhang, Ahmed Elgammal, and Dimitris\nMetaxas. Spda-cnn: Unifying semantic part detection and abstraction for \ufb01ne-grained recognition. In\nCVPR, pages 1143\u20131152, 2016.\n\n[32] Li Zhang, Tao Xiang, Shaogang Gong, et al. Learning a deep embedding model for zero-shot learning. In\n\nCVPR, pages 3010\u20133019, 2017.\n\n[33] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, and Ahmed Elgammal. A generative adversarial approach\n\nfor zero-shot learning from noisy texts. In CVPR, 2018.\n\n10\n\n\f", "award": [], "sourceid": 2928, "authors": [{"given_name": "yunlong", "family_name": "yu", "institution": "Tianjin University"}, {"given_name": "Zhong", "family_name": "Ji", "institution": "Tianjin University"}, {"given_name": "Yanwei", "family_name": "Fu", "institution": "Fudan University, Shanghai;  AItrics Inc.  Seoul"}, {"given_name": "Jichang", "family_name": "Guo", "institution": "Tianjin University"}, {"given_name": "Yanwei", "family_name": "Pang", "institution": "Tianjin University"}, {"given_name": "Zhongfei (Mark)", "family_name": "Zhang", "institution": "Binghamton University"}]}