{"title": "Semantic-Guided Multi-Attention Localization for Zero-Shot Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 14943, "page_last": 14953, "abstract": "Zero-shot learning extends the conventional object classification to the unseen class recognition by introducing semantic representations of classes. Existing approaches predominantly focus on learning the proper mapping function for visual-semantic embedding, while neglecting the effect of learning discriminative visual features. In this paper, we study the significance of the discriminative region localization. We propose a semantic-guided multi-attention localization model, which automatically discovers the most discriminative parts of objects for zero-shot learning without any human annotations. Our model jointly learns cooperative global and local features from the whole object as well as the detected parts to categorize objects based on semantic descriptions. Moreover, with the joint supervision of embedding softmax loss and class-center triplet loss, the model is encouraged to learn features with high inter-class dispersion and intra-class compactness. Through comprehensive experiments on three widely used zero-shot learning benchmarks, we show the efficacy of the multi-attention localization and our proposed approach improves the state-of-the-art results by a considerable margin.", "full_text": "Semantic-Guided Multi-Attention Localization for\n\nZero-Shot Learning\n\nYizhe Zhu\u2217\n\nRutgers University\n\nyizhe.zhu@rutgers.edu,\n\nJianwen Xie\n\nHikvision Research Institute\n\njianwen@ucla.edu\n\nZhiqiang Tang\nRutgers University\n\nzhiqiang.tang@rutgers.edu,\n\nXi Peng\n\nUniversity of Delaware\n\nxipeng@udel.edu\n\nAhmed Elgammal\nRutgers University\n\nelgammal@cs.rutgers.edu\n\nAbstract\n\nZero-shot learning extends the conventional object classi\ufb01cation to the unseen class\nrecognition by introducing semantic representations of classes. Existing approaches\npredominantly focus on learning the proper mapping function for visual-semantic\nembedding, while neglecting the effect of learning discriminative visual features. In\nthis paper, we study the signi\ufb01cance of the discriminative region localization. We\npropose a semantic-guided multi-attention localization model, which automatically\ndiscovers the most discriminative parts of objects for zero-shot learning without any\nhuman annotations. Our model jointly learns cooperative global and local features\nfrom the whole object as well as the detected parts to categorize objects based on\nsemantic descriptions. Moreover, with the joint supervision of embedding softmax\nloss and class-center triplet loss, the model is encouraged to learn features with\nhigh inter-class dispersion and intra-class compactness. Through comprehensive\nexperiments on three widely used zero-shot learning benchmarks, we show the\nef\ufb01cacy of the multi-attention localization and our proposed approach improves\nthe state-of-the-art results by a considerable margin.\n\n1\n\nIntroduction\n\nDeep convolutional neural networks have achieved signi\ufb01cant advances in object recognition. The\nmain shortcoming of deep learning methods is the inevitable requirement of large-scale labeled\ntraining data that needs to be collected and annotated by costly human labor [1, 2, 3, 4]. In spite that\nimages of ordinary objects can be readily found, there remains a tremendous number of objects with\ninsuf\ufb01cient and scarce visual data [5]. This attracts many researchers\u2019 interest in how to recognize\nobjects with few or even no training samples, which are known as few-shot learning [6, 7] and\nzero-shot learning [8, 9, 5, 10, 11, 12], respectively.\nZero-shot learning mimics the human ability to recognize objects only from a description in terms of\nconcepts in some semantic vocabulary [13]. The underlying key is to learn the association between\nvisual representations and semantic concepts and use it to extend the possibility to unseen object\nrecognition. In a general sense, the typical scheme of the state-of-the-art approaches of zero-shot\nlearning is (1) to extract the feature representation of visual data from CNN models pretrained on\nthe large-scale dataset(e.g., ImageNet), (2) to learn mapping functions to project the visual features\nand semantic representations to shared space. The mapping functions are optimized by either ridge\nregression loss [14, 15] or ranking loss on compatibility scores of two mapped features [8, 9]. Taking\n\n\u2217Work was done while Yizhe Zhu was an intern at Hikvision Research Institute.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fadvantage of the success of generative models (e.g., GAN [16], VAE [17], generator network [18]) in\ndata generation, several recent methods [5, 10, 11] resort to hallucinating visual features of unseen\nclasses, converting zero-shot learning to conventional object recognition problems.\nAll aforementioned methods neglect the signi\ufb01cance of discriminative visual feature learning. Since\nthe CNN models are pretrained on a traditional object recognition task, the extracted features may\nnot be representative enough for zero-shot learning task. Especially in the \ufb01ne-grained scenarios, the\nfeatures learned from the coarse object recognition can hardly capture the subtle difference between\nclasses. Although several recent works [13, 19] solve the problem in an end-to-end manner that is\ncapable of discovering more distinctive visual information suitable for zero-shot recognition, they\nstill simply extract the global visual feature of the whole image, without considering the effect of\ndiscriminative part regions in the images. We argue that there are multiple discriminative part areas\nthat are key points to recognize objects, especially \ufb01ne-grained objects. For instance, the head and\ntail are crucial to distinguish bird species. To capture such discriminating regions, we propose a\nsemantic-guided attention localization model to pinpoint, where the most signi\ufb01cant parts are. The\ncompactness and diversity loss on multi-attention maps are proposed to encourage attention maps to\nbe compact in the most crucial region in each map while divergent across different attention maps.\nWe combine the whole image and multiple discovered regions to provide a richer visual expression\nand learn global and local visual features (i.e., image features and region features) for the visual-\nsemantic embedding model, which is trained in an end-to-end fashion. In the zero-shot learning\nscenario, embedding softmax loss [13, 19] is used by embedding the class semantic representations\ninto a multi-class classi\ufb01cation framework. However, softmax loss only encourages the inter-class\nseparability of features. The resulting features are not suf\ufb01cient for recognition tasks [20]. To\nencourage high intra-class compactness, class-center triplet loss [21] assigns an adaptive \u201ccenter\" for\neach class and forces the learned features to be closer to the \u201ccenter\" of the corresponding class than\nother classes. In this paper, we involve both embedding softmax loss and class-center triplet loss as\nthe supervision of feature learning. We argue that these cooperative losses can ef\ufb01ciently enhance the\ndiscriminative power of the learned features.\nTo the best of our knowledge, this is the \ufb01rst work to jointly optimize multi-attention localization\nwith global and local feature learning for zero-shot learning tasks in an end-to-end fashion. Our main\ncontributions are summarized as follows:\n\n\u2022 We present a weakly-supervised multi-attention localization model for zero-shot recognition,\nwhich jointly discovers the crucial regions and learns feature representation under the\nguidance of semantic descriptions.\n\n\u2022 We propose a multi-attention loss to encourage compact and diverse attention distribution\n\nby applying geometric constraints over attention maps.\n\n\u2022 We jointly learn global and local features under the supervision of embedding softmax\nloss and class-center triplet loss to provide an enhanced visual representation for zero-shot\nrecognition.\n\n\u2022 We conduct extensive experiments and analysis on three zero-shot learning datasets and\ndemonstrate the excellent performance of our proposed method on both part detection and\nzero-shot learning.\n\n2 Related Work\n\nZero-Shot Learning Methods While several early works of zero-shot learning [22] make use of\nthe attribute as the intermediate information to infer the labels of images, the current majority of\nzero-shot learning approaches treat the problem as a visual-semantic embedding one. A bilinear\ncompatibility function between the image space and the attribute space is learned using the ranking\nloss in ALE [8] or the ridge regression loss in ESZSL [23]. Some other zero-shot learning approaches\nlearn non-linear multi-model embeddings; for example, LatEm [9] learns a piecewise linear model by\nthe selection of learned multiple linear mappings. DEM [14] presents a deep zero-shot learning model\nwith non-linear activation ReLU. More related to our work, several end-to-end learning methods\nare proposed to address the pitfall that discriminative feature learning is neglected. SCoRe [13]\ncombines two semantic constraints to supervise attribute prediction and visual-semantic embedding,\nrespectively. LDF [19] takes one step further and integrates a zoom network in the model to discover\n\n2\n\n\fFigure 1: The Framework of the proposed Semantic-Guided Multi-Attention localization model (SGMA). The\nmodel takes as input the original image and produces n part attention maps (here n = 2). Multi-Attention loss\nLM A keeps the attention areas compact in each map and divergent across different maps. The part images from\nthe cropping subnet and the original images are fed into different CNNs in the joint feature learning subnet for\nsemantic description-guided object recognition.\n\nsigni\ufb01cant regions automatically, and learn discriminative visual feature representation. However, the\nzoom mechanism can only discover the whole object by cropping out the background with a square\nshape, still being restricted to the global features. In contrast, our multi-attention localization network\ncan help \ufb01nd multiple \ufb01ner part regions (e.g., head, tail) that are discriminative for zero-shot learning.\nMulti-Attention Localization Several previous methods are proposed to leverage the extra anno-\ntations of part bounding boxes to localize signi\ufb01cant regions for \ufb01ne-grained zero-shot recogni-\ntion [24, 25, 5]. [24] straightforwardly extracts the part features by feeding annotated part regions\ninto a CNN pretrained on ImageNet dataset. [25, 5, 26] train a multiple-part detector with groundtruth\nannotations to produce the bounding boxes of parts and learn the part features with conventional\nrecognition tasks. However, the heavy involvement of human labor for part annotations makes tasks\ncostly in the real large-scale problems. Therefore, learning part attentions in a weakly supervised way\nis desirable in the zero-shot recognition scenario. Recently, several attention localization models are\npresented in the \ufb01ne-grained classi\ufb01cation scenario. [27, 28] learn a set of part detectors by analyzing\n\ufb01lters that consistently respond to speci\ufb01c patterns. Spatial transformer [29] proposes a learnable\nmodule that explicitly allows the spatial manipulation of data within the network. In [30, 28, 31],\ncandidate part models are learned from convolutional channel responses. Our work is different in\nthree aspects: (1) we learn part attention models from convolutional channel responses; (2) instead of\nusing the supervision of the classi\ufb01cation loss, our model discovers the parts with semantic guidance,\nmaking the located part more discriminative for zero-shot learning; (3) zero-shot recognition model\nand attention localization model are trained jointly to ensure the parts localization are optimized for\nthe zero-shot object recognition.\n\n3 Method\n\ni = \u03c6(yu\n\ni )}Cu\n\ni , ys\n\ni )}N\ni=1 as training data, where xs\ni , ss\ni \u2208 Y s is the corresponding class label , ss\ni = \u03c6(ys\n\nWe start by introducing some notations and the problem de\ufb01nition. Assume there are N labeled\ninstances from C s seen classes Ds = {(xs\ni \u2208 X denotes the\ni ) \u2208 S is the semantic representation of\nimage, ys\ni from an unseen class and a set of semantic representations\nthe corresponding class. Given an image xu\nof unseen classes {su\ni=1, where C u denotes the number of unseen classes, the task of\nzero-shot learning is to predict the class label yu \u2208 Y u of the image, where Y s and Y u are disjoint.\nThe framework of our approach is demonstrated in Figure 1. It consists of three modules: the multi-\nattention subnet, the region cropping subnet, the joint feature embedding subnet. The multi-attention\nsubnet generates multiple attention maps corresponding to distinct parts of the object. The region\ncropping subset crops the discriminitive parts with differentiable operations. The joint feature learning\nsubnet takes as input the cropped parts and the original image, and learns the global and local visual\nfeature for the \ufb01nal zero-shot recognition.\n\n3\n\nInputImageMulti-AttentionSubnetAttention\tMapM\"Channel-wiseattentionweight\t\ud835\udc4e\"\u2211GlobalFeature\ud835\udf03&(\ud835\udc65)LocalFeature\ud835\udf03\"(\ud835\udc65)\ud835\udc3f++,\ud835\udc3f+-.JointFeatureLearningSubnetAttendedPartsCNNbackboneStackedFCGlobalAveragePoolingChannel-wiseWeightedSumCroppingSubnet[\ud835\udc4d0,\ud835\udc4d2,\ud835\udc4d3][\ud835\udc4d0,\ud835\udc4d2,\ud835\udc4d3]CroppingOperationFeature\tMap\u2211Element-wiseSum\ud835\udc3f45\f3.1 Multi-Attention Subnet\n\nLDF [19] presents a cascaded zooming mechanism to localize the object-centric region gradually\nwhile cropping out background noise. Different from LDF, our method considers multiple \ufb01ner\ndiscriminative areas, which can provide various richer cues for object recognition. Our approach\nstarts with the multi-attention subnet to produce attention maps.\nAs shown in Figure 1, the input images \ufb01rst pass through the convolutional network backbone to\nbecome feature representations of size H \u00d7 W \u00d7 C. The attention weight vectors ai over channels\nare obtained for each attended part based on the extracted feature maps. The attention maps are \ufb01nally\nproduced by the weighted sum of feature maps over channels with the previously obtained attention\nweight vectors ai. To encourage different attention maps to discover different discriminating regions,\nwe design compactness and diversity loss. Details will be shown later.\nTo be speci\ufb01c, the channel descriptor p, encoding the global spatial information, is \ufb01rst obtained by\nusing global average pooling on the extracted feature maps. Formally, the features are shrunk through\nits spatial dimension H \u00d7 W . The cth element of p is calculated by:\n\nH(cid:88)\n\nW(cid:88)\n\ni=1\n\nj=1\n\npc =\n\n1\n\nH \u00d7 W\n\nbc(i, j),\n\n(1)\n\nwhere bc is the feature in cth channel. To make use of the information of the channel descriptor p,\nwe follow it by the stacked fully connected layers to fully capture channel-wise dependencies of\neach part. A sigmoid activation \u03c3(\u00b7) is then employed to serve as a gating mechanism. Formally, the\nchannel-wise attention weight ai is obtained by\n\nai = \u03c3(W2f (W1p)),\n\n(2)\nwhere f (\u00b7) refers to the ReLU activation function and ai can be considered as the soft-attention\nweight of channels associated with the ith part. As discovered in [32, 33], each channel of features\nfocuses on a certain pattern or a certain part of the object. Ideally, our model aims to assign high\nweights (i.e., ac\ni ) to channels associated with a certain part i, while giving low weights to channels\nirrelevant to that part.\nThe attention map for ith part is then generated by the weighted sum of all channels followed by the\nsigmoid activation:\n\nMi(x) = \u03c3(\n\nac\ni fConv(x)c),\n\n(3)\n\nwhere the superscript c means cth channel, and C is the number of channels. For brevity, we omit\n(x) in the rest of the paper. With the sigmoid activation, the attention map plays the role of gating\nmechanism as in the soft-attention scheme, which will force the network to focus on the discriminative\nparts.\nMulti-Attention Loss\nTo enable our model to discover diverse regions over attention maps, we design the multi-attention\nloss by applying the geometric constraints. The proposed loss consists of two components:\n\nLM A =\n\n[LCP T (Mi) + \u03bbLDIV (Mi)],\n\n(4)\n\nNa(cid:88)\n\ni\n\nwhere LCP T (Mi) and LDIV (Mi) are compactness and diversity losses respectively, \u03bb is a balance\nfactor, and Na is the of attention maps. Ideally, we want the attention map to concentrate around the\npeak position rather than disperse. The ideal concentrated attention map for the part i is created as a\nGaussian blob with the Gaussian peak at the peak activation of the ith attention map. Let z be the\nposition of the attention map, and Z be the set of all positions. The compactness loss is de\ufb01ned as\nfollows:\n\nLCP T (Mi) =\n\n1\n|Z|\n\n(cid:88)\n\nz\u2208Z\n\ni \u2212 (cid:101)mz\ni ||2\n2,\n\n||mz\n\n(5)\n\nwhere mz\ni denote the generated attention map and the ideal concentrated attention map at\nlocation z for ith part respectively, and |Z| denotes the size of attention maps. The L2 heatmap\n\ni and (cid:101)mz\n\nC(cid:88)\n\nc=1\n\n4\n\n\fregression loss has been widely used in human pose estimation scenarios to localize the keypoints [34,\n35], but here we use it for a different purpose.\nIntuitively, we also expect the attention maps to attend different discriminative parts. For example,\none map attends the head while another map attends the tail. To ful\ufb01ll this goal, we design a diversity\nloss LDIV to encourage the divergent attention distribution across different attention maps. Formally,\nit is formulated by:\n\nLDIV (Mi) =\n\n(cid:88)\n\nz\u2208Z\n\ni max{0,(cid:98)mz \u2212 mrg},\n\nmz\n\n(6)\n\nwhere (cid:98)mz = maxk(cid:54)=i mz\n\nk represents the maximum of other attention maps at location z and mrg\ndenotes a margin. The maximum-margin design here is to make the loss less sensitive to noises and\nimprove the robustness. The motivation of the diversity loss is that when the activation of a particular\nposition in one attention map is high, the loss prefers lower activations of other attention maps in the\nsame position. From another perspective, LDIV can be roughly considered as the inner product of\ntwo \ufb02attened matrices, which measures the similarity of two attention maps.\n\n3.2 Region Cropping Subnet\n\nWith the attention maps in hand, the region can be directly cropped with a square centered at the peak\nvalue of each attention map. However, it\u2019s hard to optimize such a non-continuous cropping operation\nwith backward propagation. Similar to [36, 19] , we design a cropping network to approximate region\ncropping. Speci\ufb01cally, with an assumption of a square shape of the part region for computational\nef\ufb01ciency, our cropping network takes as input the attention maps from the multi-attention subnet,\nand outputs three parameters:\n(7)\nwhere fCN et(\u00b7) denotes the cropping network and consists of two FC layers, tx, ty represent the\nx-axis and y-axis coordinates of the square center respectively, ts is the side length of the square. We\nproduce a two-dim continuous boxcar mask V (x, y) = Vx \u00b7 Vy:\n\n[tx, ty, ts] = fCN et(Mi),\n\nVx = f (x \u2212 tx + 0.5ts) \u2212 f (x \u2212 tx \u2212 0.5ts),\nVy = f (y \u2212 ty + 0.5ts) \u2212 f (y \u2212 ty \u2212 0.5ts),\n\n(8)\nwhere f (x) = 1/(1 + exp(\u2212kx)). The cropped region is obtained by the element-wise multiplication\n= x (cid:12) Vi, where i is the index of parts.\nbetween the original image and the continuous mask, xpart\nWe further utilize the bilinear interpolation to resize the cropped region xpart\nto the same size of the\noriginal images. Interested readers are referred to reference [36] for details.\n\ni\n\ni\n\n3.3\n\nJoint Feature Learning Subnet\n\nTo provide enhanced visual representations of images for zero-shot learning, we jointly learn the\nglobal and local visual features given the original image and part images produced by the region\ncropping subnet. As shown in Figure 1, the original image and part patches are resized to 224 \u00d7 224\nand fed into separate CNN backbone networks (with the identical VGG19 architecture). The\nconvolution layers are followed by the global average pooling to get the visual feature vector \u03b8(x).\nTo learn the discriminative features for the zero-shot learning task, we employ two cooperative losses:\nthe embedding softmax loss LCLS and the class-center triplet loss LCCT . The former encourages a\nhigher inter-class distinction, while the latter forces the learned feature of each class to be concentrated\nwith a lower intra-class divergence.\nEmbedding Softmax Loss\nLet \u03c6(y) denote the semantic feature. The compatibility score of multi-model features is de\ufb01ned\nas s = \u03b8(x)T W \u03c6(y), where W is a trainable transform matrix. If the compatibility scores are\nconsidered as logits in softmax, the embedding softmax loss can be given by:\n\n(9)\nwhere sj = \u03b8(x)T W \u03c6(yj), yj \u2208 Ys, and N is the number of training samples. In order to combine\nthe global and local features without increasing the complexity of the model, we adopt the late fusion\n\nexp(sj)\n\nlog\n\n,\n\nLCLS = \u2212 1\nN\n\n(cid:80)Ys\n\nexp(sj)\n\n5\n\n\fstrategy. The overall compatibility scores are obtained by summing up the compatibility scores from\neach CNN and used to compute the softmax loss. Note that the strategy can signi\ufb01cantly reduce the\nnumber of parameters of the network by discarding the additional dimension reduction layer (i.e., FC\nj),\ni(si\n\nlayer) after the feature concatenation used in [19]. Formally, we substitute sj in Eq. 9 with(cid:80)\n\nj = \u03b8i(x)T Wi\u03c6(yj) and i is the index of part images and the original image.\n\nwhere si\nClass-Center Triplet Loss\nThe class-center triplet loss [21] is originally designed to minimize the intra-class distances of deep\nvisual features in face recognition tasks. In our case, we jointly train the network with the class-center\ntriplet loss to encourage the intra-class compactness of features. Let i, k be the class indices, the loss\nis formulated as:\n\nLCCT = max{0, mrg + ||(cid:98)\u03c6i \u2212 (cid:98)Ci||2\n\n2 \u2212 ||(cid:98)\u03c6i \u2212 (cid:98)Ck||2\n\n(10)\nwhere mrg is the margin, \u03c6i is the mapped visual feature in semantic feature space (i.e., \u03c6i =\n\n\u03b8(x)T Wi), Ci denotes the \u201ccenter\" of each class that are trainable parameters,(cid:98)\u00b7 means L2 normaliza-\n\ntion operation. The normalization operation is involved to make feature points located on the surface\nof a unit hypersphere, leading to the ease of setting the proper margin. Moreover, class-center triplet\nloss exempts the necessity of triple sampling in the naive triplet loss.\nOverall, the proposed SGMA model is trained in an end-to-end manner with the objective:\n\n2}i(cid:54)=k,\n\nLSGM A = LM A + \u03b11LCLS + \u03b12LCCT ,\n\n(11)\n\nwhere the balance factor \u03b11 and \u03b12 are consistently set to 1 in all experiments.\n\n3.4\n\nInference from SGMA Model\n\nWe provide two ways to infer the labels of unseen class images from the SGMA model. The \ufb01rst one\nis straightforwardly to choose the class label with the maximal overall compatibility score, as the\ngreen path in Figure 1. An alternative way is utilizing the features \u03c6cct(x) learned in the class-center\nbranch, as the purple path in Figure 1. The class label can be inferred based on the similarities\nbetween the feature of the test image \u03c6cct(x) and the prototypes of unseen classes \u03a6u\ncct, which can\nbe obtained by the following steps. We assume the semantic descriptions of unseen classes can be\nrepresented by a linear combination of those of seen classes. Let W be the weight matrix of such a\ncombination, and W can be obtained by solving the ridge regression:\n2 + \u03bb||W||2\n2,\n\n||\u03a6u \u2212 W \u03a6s||2\n\nW = arg min\n\n(12)\n\nW\n\nwhere \u03a6u and \u03a6s are the semantic matrices of unseen and seen classes with each row being the\nsemantic vector of each class. Equipped with the learned W describing the relationship of the\nsemantic vectors of seen and unseen classes, we can obtain the prototypes for unseen classes by\napplying the same W , \u03a6u\ncct, where \u03a6s\ncct is the prototypes of seen classes obtained by\naveraging the features of all images of each class.\nTo combine the global and local descriptions of images, we concatenate the visual features generated\nby different CNNs. Moreover, to combine the inference of two ways, the compatibility scores from\nthe embedding softmax branch and the similarity scores from the class-agent triplet branch are added\nas the \ufb01nal prediction scores of the test image w.r.t. unseen classes:\n\ncct = W \u03a6s\n\n(13)\nwhere sy = \u03b8(x)T W \u03c6(y), (cid:104)\u00b7(cid:105) denotes inner product, [\u00b7]y denotes the row of the matrix corresponding\nto the class y, and \u03b2 a balancing factor to control the contribution of the class-center branch.\n\ny = arg min\ny\u2208YU\n\n(sy + \u03b2(cid:104)\u03c6cct(x), [\u03a6u\n\ncct]y(cid:105)),\n\n4 Experiment\n\nTo evaluate the empirical performance of our proposed approach, we conduct experiments on three\nstandard zero-shot learning datasets and compare our method with the state-of-the-art ones. We then\nshow the performance of multi-attention localization. In our experiment, we only use two attention\nmaps as we \ufb01nd that more maps will cause severe overlap among attended regions and hardly improve\nthe zero-shot learning performance.\n\n6\n\n\fFigure 2: Part detection results on three benchmarks. Each row displays three examples of results. Each result\nconsists of three images, where the detected parts are marked with blue and red bounding boxes in the \ufb01rst\nimage, and the rest two images are the corresponding generated attention maps.\n\n4.1\n\nImplementation Details and Model Initialization\n\nWe implement our approach on the Pytorch Framework. For the multi-attention subnet, we take the\nimages of size 448 \u00d7 448 as input in order to achieve high-resolution attention maps. For the joint\nfeature embedding subnet, we resize all the input images to the size of 224 \u00d7 224. We consistently\nadopt VGG19 as the backbone and train the model with a batch size of 32 on two GPUs (TitanX).\nWe use the SGD optimizer with the learning rate of 0.05, the momentum of 0.9, and weight decay of\n5 \u2217 10\u22124 to optimize the objective functions. The learning rate is decay by 0.1 on the plateau, and the\nminimum one is set to be 5 \u2217 10\u22124. Hyper-parameters in our models are obtained by grid search on\nthe validation set. mrgs in Eq. 7 and Eq. 10 are set to be 0.2 and 0.8, respectively. k in Eq. 8 is set to\nbe 10. The number of parts is set to be 2 since we \ufb01nd that increasing the number of parts will result\nin little improvement on the zero-shot learning performance and lead to attention redundancy, i.e.,\nmaps attend to the same region.\nFor multi-attention subnet, we apply unsupervised k-means clustering to group channels based on\nthe peak activation positions and initialize ai with the pseudo labels generated by the clustering.\nInterested readers are referred to reference [36] for details. The attention maps from the initialized\nmulti-attention subnet are leveraged to pretrain the region cropping subnet. Speci\ufb01cally, we obtain\nthe attended region in attention maps by a discriminative square centered at the peak response of the\nattention map ([px, py]). The side length of the squares ts is assumed to be the quarter of the image\nsize. The coordinates of the attended region ([px, py, ts]) are considered as pseudo ground truths to\npretrain the cropping subnet with MSE loss, and the attended regions are utilized as the cropped parts\nto pretrain the joint feature learning subnet.\n\n4.2 Datasets and Experiment Settings\n\nWe use three widely used zero-shot learning datasets: Caltech-UCSD-Birds 200-2011 (CUB) [37],\nOxford Flowers (FLO) [38], Animals with Attributes (AwA) [22]. CUB is a \ufb01ne-grained dataset\nof bird species, containing 11,788 images from 200 different species and 312 attributes. FLO is a\n\ufb01ne-grained dataset, consisting of 8,189 images from 102 different types of \ufb02owers without attribute\nannotations. However, the visual descriptions are available and collected by [39]. Finally, AwA is a\ncoarse-grained dataset with 30,475 images, 50 classes of animals, and 85 attributes.\nTo fairly compare with baselines, we use the attributes or sentence features provided by [40, 10]\nas semantic features for all methods. For non-end-to-end methods, we consistently use 2,048-\ndimensional features extracted from a pretrained 101-layer ResNet provided by [40], and for end-to-\nend methods, we adopt VGG19 as the backbone network. Besides, [40] points out that several test\nclasses in the standard splitting (marked as SS) of zero-shot learning setting are utilized for training\nthe feature extraction network, which violates the spirit of zero-shot that test classes should never be\nseen before. Therefore, we also evaluate methods on the splitting proposed by [10] (marked as PS).\nWe measure the quantitative performance of the methods in terms of Mean Class Accuracy (MCA).\n\n4.3 Part Detection Results\n\nTo evaluate the ef\ufb01cacy of weakly supervised part detection, we compare our detection results on\nCUB with SPDA-CNN [41], a state-of-the-art work on part detectors trained with ground truth\npart annotations. We observe our model consistently attend the head or tail on two attention maps\n\n7\n\n\fTable 1: Part detection results measured by average\nprecision(%).\nMethod\nHead\nSPDA-CNN\n90.9\nOurs\n74.9\nOurs w/o MA 65.7\nRandom\n25.6\n\nrespectively. Therefore, we compare the detected parts with head or tail ground truth annotations.\nPart detection is considered correct if it has at least 0.5 overlap with ground truth (i.e., IoU > 0.5).\nAs shown in Table 1, the SPDA-CNN can be con-\nsidered the upper bound since it leverages part an-\nnotation to train detectors. We also provide the\nresults of random crops that serve as a lower bound.\nCompared with the random crops, our method has\nachieved an improvement of 35.7% on average. Al-\nthough there is still a small gap between the perfor-\nmances of ours and SPDA-CNN (61.5%v.s.79.1%)\ndue to the lack of precise part annotations, the re-\nsults are promising since our model is more prac-\ntical in the large-scale real-world tasks where costly annotations are not available. Besides, if we\nremove the proposed multi-attention loss (marked as \u201cours w/o MA\u201d), the performance suffers a\nsigni\ufb01cant drop (47.6% v.s. 61.5%), con\ufb01rming the effect of the multi-attention loss.\nWe also show the qualitative results of part localization in Figure 2. The detected parts are well-\naligned with semantic parts of objects. In CUB, two parts are associated with the head and the legs\nof birds, while the parts are the head and rear body of the animals in AwA. In FLO, the stamen and\npistil are roughly detected in the red box, while the petal is localized as another crucial part.\nTable 2: Zero-shot learning results on CUB, AWA, FLO benchmarks. The best scores and second best ones are\nmarked bold and underline respectively.\n\nTail Average\n67.2\n48.1\n29.4\n26.0\n\n79.1\n61.5\n47.6\n25.8\n\nCUB\n\nAWA\n\nFLO\n\n40.4\n48.5\n53.4\n51.0\n\n-\n\n45.6\n41.6\n60.5\n60.9\n\n-\n\n65.9\n\nMethod\nLATEM (2016)\nALE (2015)\nSJE (2015)\nESZSL (2015)\nSYNC (2016)\nSAE (2017)\nDEM (2017)\nGAZSL (2018)\nSCoRe (2017)\nLDF (2018)\nOurs\n\nSS\n49.4\n53.2\n55.3\n55.1\n54.1\n33.4\n51.8\n57.5\n59.5\n67.1\n70.5\n\nPS\n49.3\n54.9\n53.9\n53.9\n55.6\n33.3\n51.7\n55.8\n62.7\n67.5\n71.0\n\nSS\n74.8\n78.6\n76.7\n74.7\n72.2\n80.6\n80.3\n77.1\n82.8\n83.4\n83.5\n\nPS\n55.1\n59.9\n65.6\n58.2\n54.0\n53.0\n65.7\n63.7\n61.6\n65.5\n68.8\n\n4.4 Zero-Shot Classi\ufb01cation Results\n\nWe compare our method with two groups of state-of-the-art methods: non-end-to-end methods that\nuse visual features extracted from pretrained CNN, and end-to-end methods that jointly train CNN\nand visual-sementic embedding network. The former group includes LATEM [9], ALE [8], SJE [42],\nESZSL [23], SYNC [43], SAE [15], DEM [14], GAZSL [5], and the latter one includes SCoRe [13],\nLDF [19]. The evaluation results are shown in Table 2. Different groups of approaches are separated\nby a horizontal line. The scores of baselines (DAP-SAE) are obtained from [40, 10]. As the codes of\nDEM, GAZSL, SCoRe are available online, we obtain the results by running the codes on different\nsettings if they are not published. We get all the results of LDF from the authors.\nIn general, we observe that the end-to-end methods outperform the non-end-to-end methods. That\ncon\ufb01rms that the joint training of the CNN model and the embedding model eliminates the discrepancy\nbetween features for conventional object recognition and those for zero-shot one that exists in non-\nend-to-end methods.\nIt\u2019s worth noting that LDF learns object localization by integrating an additional zoom network to\nthe whole model, while our approach further involves part-level patches to provide local features of\nobjects. It is clear that our proposed model consistently outperforms previous approaches, achieving\nimpressive gains over the state-of-the-arts on \ufb01ne-grained datasets: 3.4%, 3.5% on CUB SS/PS\nsettings, and 5.0% on FLO. We \ufb01nd that the complexity of our model can be reduced by using the same\nCNN with shared weights for both image and part patches, but the zero-shot learning performance is\nslightly degraded, e.g., the score for CUB-PS and AWA-PS decreases by 3.6%, 2.7%, respectively.\n\n8\n\n\fTable 3: Generalized zero-shot learning results (%).\n\nWe also evaluate our method on the general-\nized zero-shot learning setting, where the test\nimages come from all classes including both\nH\nseen and unseen categories. We report the\nperformances of classifying test images from\n47.3\n43.8\nunseen classes and seen classes into the joint\nlabel space, denoted as AU and AS respec-\n17.6\n52.5\ntively, and the harmonic mean H = 2\u00b7AS\u00b7AU\nAS +AU .\nAs shown in Table 3, our model outperforms previous state-of-the-art methods with respect to H\nscore. Especially in the CUB dataset where discriminative parts are crucial to capture the subtle\ndifference among \ufb01ne-grained classes, our method improves the H score by 6.7% (48.5% v.s. 41.8%).\n\nMethod\nDEM [14]\nGAZSL [5]\nLDF [19]\nOurs\n\nCUB\nAS\n57.9\n61.3\n81.6\n71.3\n\nAwA\nAS\n84.7\n84.2\n87.4\n87.1\n\nH\n29.2\n41.8\n39.9\n48.5\n\nAU\n32.8\n29.6\n9.8\n37.6\n\nAU\n19.6\n31.7\n26.4\n36.7\n\n4.5 Ablation Study\n\nIn this section, we study the effectiveness of the detected object regions and \ufb01ner part patches, as\nwell as the joint supervision of embedding softmax and class-center triplet loss. We set our baseline\nto be the model without localizing parts and with only embedding softmax loss as the objective.\nEffect of discriminative regions. The upper part of Table 4 shows the performance of our method\nwith different image inputs. Our model with only part regions performs worst because part regions\nonly provide local features of an object, such as the features of head or leg. Although these local\nfeatures are discriminative in the part level, it misses lots of information contained in other regions,\nand thus cannot recognize the whole object well alone. When we combine the original image and the\nlocalized parts, the performance has a signi\ufb01cant improvement from the baseline by 5.4% (65.2% v.s.\n59.8%).\nTo further demonstrate the effectiveness of the localized parts and objects, we combine the object\nwith randomly cropped parts of the same part size. From the results, we observe, in most cases,\nadding random parts will hurt the performance. We believe it\u2019s due to the lack of alignment of random\ncropped parts. For instance, one random part in an image is roughly the head of the object, while it\nmay focus on the leg in another image. In contrast, our localized parts have better semantic alignment,\nas shown in Figure 2.\nTable 4: The performance of variants on zero-shot learning with PS setting. The best scores are marked bold.\n\nMethod\nBaseline\nParts\nBaseline+Parts\nBaseline+Random Parts\nEmbedding Softmax\nClass-Center Triplet\nCombined\n\nCUB AWA FLO Avg\n59.8\n60.2\n52.1\n55.4\n65.2\n67.4\n57.5\n56.3\n60.9\n60.2\n62.6\n62.1\n63.5\n63.7\n\n61.5\n51.2\n64.3\n59.8\n62.4\n64.6\n65.7\n\n57.7\n49.8\n63.9\n56.4\n57.2\n61.1\n61.8\n\nEffect of joint loss. The bottom part of Table 4 shows the results on different ways of inferences\nwhen our model is trained with the joint loss as the objective and only the original image as input.\nCompared with the baseline, the results inferred from the embedding softmax branch get improved a\nlittle as class-center triplet loss can be considered a regularizer to enhance the discriminative features.\nThe results inferred from the class-center triplet branch are better, and we get the best results when\ncombining the inferences of these two branches, which improves the baseline results by 3.9%.\n\n5 Conclusion\n\nIn the paper, we show the signi\ufb01cance of discriminative parts for zero-shot object recognition. It\nmotivates us to design a semantic-guided attention localization model to detect such discriminative\nparts of objects guided by semantic representations. The multi-attention loss is proposed to favor\ncompact and diverse attentions. Our model jointly learns global and local features from the original\nimage and the discovered parts with embedding softmax loss and class-center triplet loss in an end-to-\nend fashion. Extensive experiments show that the proposed method outperforms the state-of-the-art\nmethods.\nAcknowledgments\nThis work is partially supported by NSFUSA award 1409683.\n\n9\n\n\fReferences\n[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image\n\nDatabase. In CVPR, 2009.\n\n[2] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\n\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.\n\n[3] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for\n\nmulti-source domain adaptation. In ICCV, 2019.\n\n[4] Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. Shifting more attention to video\n\nsalient object detection. In CVPR, 2019.\n\n[5] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial\n\napproach for zero-shot learning from noisy texts. In CVPR, 2018.\n\n[6] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot\n\nlearning. In NeurIPS, 2016.\n\n[7] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NeurIPS,\n\n2017.\n\n[8] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image\n\nclassi\ufb01cation. T-PAMI, 38(7), 2016.\n\n[9] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent\n\nembeddings for zero-shot classi\ufb01cation. In CVPR, 2016.\n\n[10] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot\n\nlearning. In CVPR, 2018.\n\n[11] Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. Learning feature-to-feature translator by\n\nalternating back-propagation for generative zero-shot learning. In ICCV, 2019.\n\n[12] Ziyu Wan, Dongdong Chen, Yan Li, Xingguang Yan, Junge Zhang, Yizhou Yu, and Jing Liao. Transductive\n\nzero-shot learning with visual structure constraint. In NeurIPS, 2019.\n\n[13] Pedro Morgado and Nuno Vasconcelos. Semantically consistent regularization for zero-shot recognition.\n\nIn CVPR, 2017.\n\n[14] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot learning. In\n\nCVPR, 2017.\n\n[15] Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In CVPR,\n\n2017.\n\n[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.\n\n[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[18] Tian Han, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Alternating back-propagation for generator\n\nnetwork. In AAAI, 2017.\n\n[19] Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. Discriminative learning of latent features for\n\nzero-shot recognition. In CVPR, 2018.\n\n[20] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for\n\ndeep face recognition. In ECCV, 2016.\n\n[21] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: L2 hypersphere embedding for\n\nface veri\ufb01cation. In ACMMM, 2017.\n\n[22] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classi\ufb01cation for zero-shot\n\nvisual object categorization. T-PAMI, 36(3), 2014.\n\n[23] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In\n\nICML, 2015.\n\n10\n\n\f[24] Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. Multi-cue zero-shot learning with\n\nstrong supervision. In CVPR, 2016.\n\n[25] Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed Elgammal. Link the head to the \u201cbeak\u201d: Zero\n\nshot learning from noisy text description at part precision. In CVPR, 2017.\n\n[26] Zhong Ji, Yanwei Fu, Jichang Guo, Yanwei Pang, Zhongfei Mark Zhang, et al. Stacked semantics-guided\n\nattention model for \ufb01ne-grained zero-shot learning. In NeurIPS, 2018.\n\n[27] Dequan Wang, Zhiqiang Shen, Jie Shao, Wei Zhang, Xiangyang Xue, and Zheng Zhang. Multiple\n\ngranularity descriptors for \ufb01ne-grained categorization. In ICCV, 2015.\n\n[28] Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao Lin, and Qi Tian. Picking deep \ufb01lter responses\n\nfor \ufb01ne-grained image recognition. In CVPR, 2016.\n\n[29] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NeurIPS,\n\n2015.\n\n[30] Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. The application\nof two-level attention models in deep convolutional neural network for \ufb01ne-grained image classi\ufb01cation.\nIn CVPR, 2015.\n\n[31] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learning multi-attention convolutional neural\n\nnetwork for \ufb01ne-grained image recognition. In ICCV, 2017.\n\n[32] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features\n\nfor discriminative localization. In CVPR, 2016.\n\n[33] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and\nDhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV,\n2017.\n\n[34] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap\n\nregression. In ECCV, 2016.\n\n[35] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In\n\nCVPR, 2016.\n\n[36] Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional\n\nneural network for \ufb01ne-grained image recognition. In CVPR, 2017.\n\n[37] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n[38] Maria-Elena Nilsback and Andrew Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of\n\nclasses. In ICVGIP, 2008.\n\n[39] Scott Reed, Zeynep Akata, Bernt Schiele, and Honglak Lee. Learning deep representations of \ufb01ne-grained\n\nvisual descriptions. In CVPR, 2016.\n\n[40] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive\n\nevaluation of the good, the bad and the ugly. T-PAMI, 41(9), 2019.\n\n[41] Han Zhang, Tao Xu, Mohamed Elhoseiny, Xiaolei Huang, Shaoting Zhang, Ahmed Elgammal, and Dimitris\nMetaxas. Spda-cnn: Unifying semantic part detection and abstraction for \ufb01ne-grained recognition. In\nCVPR, 2016.\n\n[42] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embed-\n\ndings for \ufb01ne-grained image classi\ufb01cation. In CVPR, 2015.\n\n[43] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classi\ufb01ers for zero-shot\n\nlearning. In CVPR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 8507, "authors": [{"given_name": "Yizhe", "family_name": "Zhu", "institution": "Rutgers University"}, {"given_name": "Jianwen", "family_name": "Xie", "institution": "Hikvision"}, {"given_name": "Zhiqiang", "family_name": "Tang", "institution": "Rutgers"}, {"given_name": "Xi", "family_name": "Peng", "institution": "University of Delaware"}, {"given_name": "Ahmed", "family_name": "Elgammal", "institution": "Rutgers University"}]}