{"title": "Cross Attention Network for Few-shot Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 4003, "page_last": 4014, "abstract": "Few-shot classification aims to recognize unlabeled samples from unseen classes given only few labeled samples. The unseen classes and low-data problem make few-shot classification very challenging. Many existing approaches extracted features from labeled and unlabeled samples independently, as a result, the features are not discriminative enough. In this work, we propose a novel Cross Attention Network to address the challenging problems in few-shot classification. Firstly, Cross Attention Module is introduced to deal with the problem of unseen classes.  The module generates cross attention maps for each pair of class feature and query sample feature so as to highlight the target object regions, making the extracted feature more discriminative. Secondly, a transductive inference algorithm is proposed to alleviate the low-data problem, which iteratively utilizes the unlabeled query set to augment the support set, thereby making the class features more representative. Extensive experiments on two benchmarks show our method is a simple, effective and computationally efficient framework and outperforms the state-of-the-arts.", "full_text": "Cross Attention Network for Few-shot Classi\ufb01cation\n\nRuibing Hou1,2, Hong Chang1,2, Bingpeng Ma2, Shiguang Shan1,2,3, Xilin Chen1,2\n\n1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS),\n\nInstitute of Computing Technology, CAS, China\n\n2University of Chinese Academy of Sciences, China\n\n3CAS Center for Excellence in Brain Science and Intelligence Technology, China\n\nruibing.hou@vipl.ict.ac.cn, {changhong, sgshan,xlchen}@ict.ac.cn, bpma@ucas.ac.cn\n\nAbstract\n\nFew-shot classi\ufb01cation aims to recognize unlabeled samples from unseen classes\ngiven only few labeled samples. The unseen classes and low-data problem make\nfew-shot classi\ufb01cation very challenging. Many existing approaches extracted fea-\ntures from labeled and unlabeled samples independently, as a result, the features\nare not discriminative enough. In this work, we propose a novel Cross Attention\nNetwork to address the challenging problems in few-shot classi\ufb01cation. Firstly,\nCross Attention Module is introduced to deal with the problem of unseen classes.\nThe module generates cross attention maps for each pair of class feature and query\nsample feature so as to highlight the target object regions, making the extracted fea-\nture more discriminative. Secondly, a transductive inference algorithm is proposed\nto alleviate the low-data problem, which iteratively utilizes the unlabeled query set\nto augment the support set, thereby making the class features more representative.\nExtensive experiments on two benchmarks show our method is a simple, effective\nand computationally ef\ufb01cient framework and outperforms the state-of-the-arts.\n\n1\n\nIntroduction\n\nFew-shot classi\ufb01cation aims at classifying unlabeled samples (query set) into unseen classes given\nvery few labeled samples (support set). Compared to traditional classi\ufb01cation, few-shot classi\ufb01cation\nhas two main challenges: One is unseen classes, i.e., the non-overlap between training and test\nclasses; The other is the low-data problem, i.e., very few labeled samples for the test unseen classes.\nSolving few-shot classi\ufb01cation problem requires the model trained with seen classes to generalize\nwell to unseen classes with only few labeled samples. A straightforward approach is \ufb01ne-tuning a pre-\ntrained model using the few labeled samples from the unseen classes. However, it may cause severe\nover\ufb01tting. Regularization and data augmentation can alleviate but cannot fully solve the over\ufb01tting\nproblem. Recently, meta-learning paradigm [38, 39, 22] is widely used for few-shot learning. In\nmeta-learning, the transferable meta-knowledge, which can be an optimization strategy [31, 1], a\ngood initial condition [7, 16, 24], or a metric space [35, 40, 37], is extracted from a set of training\ntasks and generalizes to new test tasks. The tasks in the training phase usually mimic the settings in\nthe test phase to reduce the gap between training and test settings and enhance the generalization\nability of the model.\nWhile promising, few of them pay enough attention to the discriminability of the extracted features.\nThey generally extract features from the support classes and unlabeled query samples independently,\nas a result, the features are not discriminative enough. For one thing, the test images in the\nsupport/query set are from unseen classes, thus their features can hardly attend to the target\nobjects. To be speci\ufb01c, for a test image containing multiple objects, the extracted feature may attend\nto the objects from seen classes which have large number of labeled samples in the training set,\nwhile ignore the target object from unseen class. As illustrated in Fig. 1 (c) and (d), for the two\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1. An example of the class activation maps [48] of training and test images of existing method [35] and\nour method CAN. Warmer color with higher value.\nimages from the test class curtain, the extracted features only capture the information of the objects\nthat are related to the training classes, such as person or chair in Fig. 1 (a) and (b). For another,\nthe low-data problem makes the feature of each test class not representative for the true class\ndistribution, as it is obtained from very few labeled support samples. In a word, the independent\nfeature representations may fail in few-shot classi\ufb01cation.\nIn this work, we propose a novel Cross Attention Network (CAN) to enhance the feature discrim-\ninability for few-shot classi\ufb01cation. Firstly, Cross Attention Module (CAM) is introduced to deal\nwith the unseen class problem. The cross attention idea is inspired by the human few-shot learning\nbehavior. To recognize a sample from unseen class given a few labeled samples, human tends to\n\ufb01rstly locate the most relevant regions in the pair of labeled and unlabeled samples. Similarly, given\na class feature map and a query sample feature map, CAM generates a cross attention map for each\nfeature to highlight the target object. Correlation estimation and meta fusion are adopted to achieve\nthis purpose. In this way, the target object in the test samples can get attention and the features\nweighted by the cross attention maps are more discriminative. As shown in Fig. 1 (e), the extracted\nfeatures with CAM can roughly localize the regions of target object curtain. Secondly, we introduce\na transductive inference algorithm that utilizes the entire unlabeled query set to alleviate the low-data\nproblem. The proposed algorithm iteratively predicts the labels for the query samples, and selects\npseudo-labeled query samples to augment the support set. With more support samples per class, the\nobtained class features can be more representative, thus alleviating the low-data problem.\nExperiments are conducted on multiple benchmark datasets to compare the proposed CAN with\nexisting few-shot meta-learning approaches. Our method achieves new state-of-the art results on all\ndataset, which demonstrates the effectiveness of CAN.\n\n2 Related Work\n\nFew-Shot Classi\ufb01cation. On the basis of the availability of the entire unlabeled query set, few-shot\nclassi\ufb01cation can be divided into two categories: inductive and transductive few-shot classi\ufb01cation.\nIn this work, we mainly explore the few-shot approaches based on meta-learning.\nInductive Few-shot Learning has been a well studied area in recent years. One promising way is the\nmeta-learning [38, 39, 22] paradigm. It usually trains a meta-learner from a set of tasks, which extracts\nmeta-knowledge to transfer into new tasks with scarce data. Meta learning approaches for few-shot\nclassi\ufb01cation can be roughly categorized into three groups. Optimization-based methods designed the\nmeta-learner as an optimizer that learned to update model parameters [1, 31, 18]. Further, the works\n[7, 33, 36] learned a good initialization so that the learner could rapidly adapt to novel tasks within a\nfew optimization steps. Parameter-generating based methods [2, 20, 21, 3] usually designed the meta-\nlearner as a parameter predicting network. Metric-learning based methods [40, 35, 37, 26, 5] learned\na common feature space where categories can distinguish with each other based on a distance metric.\nFor example, Matching Network [40] produced a weighted nearest neighbor classi\ufb01er. Prototypical\nNetwork [35] performed nearest neighbor classi\ufb01cation with learned class features (prototypes). The\nworks [37, 26, 15] improved the prototypical network with a learnable similarity metric [37], a task\nadaptive metric [26], or a image-to-class local metric [5].\nOur proposed framework belongs to metric-learning based method. Different from existing metric-\nlearning based methods which extracted the support and query sample features independently, our\nmethod exploits the semantic relevance between support and query features to highlight the target\nobject. Although the parameter-generating based methods also consider the relationship between\nsupport and query samples, these approaches require an additional complex parameter prediction\nnetwork. With less overload, our approach outperforms these methods by a large margin.\nTransductive Algorithm.\nTransductive few-shot classi\ufb01cation is \ufb01rstly introduced in [17], which\nconstructed a graph on the support set and the entire query set, and propagated labels within the\n\n2\n\n(c) Input test image(d) Class Activation Mapof existing method(e) Class Activation Map of our CAN(a) Input training image(b) Class Activation Mapof existing method\fgraph. However, the method required a speci\ufb01c architecture, making it less universal. Inspired\nby the self-training strategy in semi-supervised learning [6, 25, 42, 32], we propose a simper and\nmore general transductive few-shot algorithm, which explicitly augments the labeled support set\nwith unlabeled query samples to achieve more representative class features. Moreover, the proposed\ntransductive algorithm can be directly applied to the existing models, e.g., prototypical network [35],\nmatching network [40], and relation network [37].\nAttention Model. Attention mechanisms aim to highlight important local regions to extract more\ndiscriminative features. It has achieved great success in computer vision applications, such as image\nclassi\ufb01cation [12, 41, 27], person re-identi\ufb01cation [10, 47, 11], image caption [29, 44, 4] and visual\nquestion answering [43, 45, 46, 30]. In image classi\ufb01cation, SENet [12] proposed a channel attention\nblock to boost the representational power of a network. Woo et al. [41, 27] further integrated the\nchannel and spatial attention modules to a block. In image caption, the attention blocks [44, 4]\nusually used the last generated words to search for related regions in the image to generate the next\nword. And in visual question answering, the attention blocks [43, 45, 46, 30] used the questions to\nlocalize the related regions in the image to answer. Speci\ufb01cally, [30] uses the questions to generate a\nconvolutional kernel which is used to convolve with the image feature. On the contrary, our method\nuses a meta-learner to generate a kernel which is used to fuse the relations to get the \ufb01nal attention\nmap. For few-shot image classi\ufb01cation, in this paper, we design a meta-learner to compute the cross\nattention between support (or class) and query feature maps, which helps to locate the important\nregions of the target object and enhance the feature discriminability.\n\n3 Cross Attention Module\n\na, ys\n\nProblem De\ufb01ne.\nFew-shot classi\ufb01cation usually involves a training set, a support set and a query\nset. The training set contains a large number of classes and labeled samples. The support set of few\nlabeled samples and the query set of unlabeled samples share the same label space, which is disjoint\nwith that of the training set. Few-shot classi\ufb01cation aims to classify the unlabeled query samples\ngiven the training set and support set. If the support set consists of C classes and K labeled samples\nper class, the target few-shot problem is called C-way K-shot.\nFollowing [40, 35, 34, 19, 9, 14, 7], we adopt the episode training mechanism, which has been\ndemonstrated as an effective approach for few-shot learning. The episodes used in training simulate\nthe settings in test. Each episode is formed by randomly sampling C classes and K labeled samples\na)}ns\na=1 (ns = C \u00d7 K), and a fraction of the rest samples\nper class as the support set S = {(xs\nfrom the C classes as the query set Q = {(xq\nb=1. And we denote S k as the support subset of\nb )}nq\nb, yq\nthe kth class. How to represent each support class S k and query sample xq\nb and measure the similarity\nbetween them is a key issue for few-shot classi\ufb01cation.\nCAM Overview.\nIn this work, we resort to metric-learning to obtain proper feature representations\nfor each pair of support class and query sample. Different from existing methods which extract the\nclass and query features independently, we propose Cross Attention Module (CAM) which can model\nthe semantic relevance between the class feature and query feature, thus draw attention to the target\nobjects and bene\ufb01t the subsequent matching.\nCAM is illustrated in Fig. 2. The class feature map P k \u2208 Rc\u00d7h\u00d7w is extracted from the support\nsamples in S k (k \u2208 {1, 2, . . . , C}) and the query feature map Qb \u2208 Rc\u00d7h\u00d7w is extracted from the\nb (b \u2208 {1, 2, . . . , nq}), where c, h and w denote the number of channel, height and\nquery sample xq\nwidth of the feature maps respectively. CAM generates cross attention map Ap (Aq) for P k (Qb),\nwhich is then used to weight the feature map to achieve more discriminative feature representation \u00afP k\nb\n( \u00afQb\nk). For simplicity, we omit the superscripts and subscripts, and denote the input class and query\nfeature maps as P and Q, and the output class and query feature maps as \u00afP and \u00afQ, respectively.\nCorrelation Layer. As shown in Fig. 2, we \ufb01rst design a correlation layer to calculate a correlation\nmap between P and Q, which is then used to guide the generation of the cross attention maps. To\nthis end, we \ufb01rst reshape P and Q to Rc\u00d7m, i.e., P = [p1, p2, . . . , pm] and Q = [q1, q2, . . . , qm],\nwhere m (m = h \u00d7 w) is the number of spatial positions on each feature map. pi, qi \u2208 Rc are the\nfeature vectors at the ith spatial position in P and Q respectively. The correlation layer computes\nthe semantic relevance between {pi}m\ni=1 with cosine distance to get the correlation map\n\ni=1 and {qi}m\n\n3\n\n\fFigure 2. (a) Cross Attention Module (CAM). (b) the Fusion Layer in CAM. In the \ufb01gure, Rp (Rq) \u2208 Rm\u00d7m is\nreshaped to Rm\u00d7h\u00d7w for a better visualization. As seen, CAM can generate the feature maps that attend to the\nregions of target object (coated retriever in the \ufb01gure).\nR \u2208 Rm\u00d7m as:\n\n, i, j = 1, . . . , m.\n\n(1)\n\n(cid:18) pi\n\n||pi||2\n\nRij =\n\n(cid:19)T(cid:18) qj\n\n(cid:19)\n\n||qj||2\n\n= R = [rq\n\n1, rq\n\n2, . . . , rq\n\n2, . . . , rp\n\nm] and the query correlation map Rq .\n\nFurthermore, we de\ufb01ne two correlation maps based on R: the class correlation map Rp .\n= RT =\ni \u2208 Rm denotes\n1, rp\n[rp\nthe relevance between the local class feature vector pi and all query feature vectors {qi}m\ni=1, and\ni \u2208 Rm is the relevance between local query feature vector qi and all class feature vectors {pi}m\nrq\ni=1.\nIn this way, Rp and Rq characterize the local correlations between the class and query feature maps.\nMeta Fusion Layer. A meta fusion layer is then used to generate the class and query attention maps,\nrespectively, based on the corresponding correlation maps. We take the class attention map as an\nexample. As shown in Fig. 2 (b), the fusion layer takes the class correlation map Rp as input, and\napplies convolutional operation with a m \u00d7 1 kernel, w \u2208 Rm\u00d71, to fuse each local correlation vector\n{rp\ni } of Rp into an attention scalar. A softmax function is then used to normalize the attention scalar\nto obtain the class attention at the ith position:\n\nm], where rp\n\ni )/\u03c4(cid:1)\nexp(cid:0)(wT rp\nj=1 exp(cid:0)(wT rp\nj )/\u03c4(cid:1) ,\n(cid:80)h\u00d7w\n\nAp\n\ni =\n\n(2)\n\nwhere \u03c4 is the temperature hyperparameter. Lower temperature leads to lower entropy, making the\ndistribution concentrate on a few high con\ufb01dence positions. The class attention map is then obtained\nby reshaping Ap to matrix in Rh\u00d7w. Note that the kernel w plays a crucial role in the fusion. It\naggregates the correlations between the local class feature pi and all local query features {qj}m\nj=1\nas the attention scalar at the ith position. More importantly, the weighted aggregation should draw\nattention to the target object, instead of simply highlighting the visually similar regions across support\nclass and query sample.\nBased on above analysis, we design a meta-learner to adaptively generate the kernel based on the\ncorrelation between the class and the query features. To this end, we apply global average pooling\n(GAP) operation (i.e., row-wise averaging) to Rp to obtain an averaged query correlation vector,\nwhich is then fed into the meta-learner to generate the kernel w \u2208 Rm:\n\nw = W2(\u03c3(W1(GAP(Rp)),\n\n(3)\n\nr \u00d7m and W2 \u2208 Rm\u00d7 m\n\nwhere W1 \u2208 R m\nr are the parameters of the meta-learner, and r is the reduction\nratio. \u03c3 refers to the ReLU function [23]. The nonlinearity in the meta-learning model allows a\n\ufb02exible transformation. For each pair of class and query features, the meta-learner is expected to\ngenerate a kernel w which can draw cross attention to the target object. This is achieved in meta\ntraining by minimizing the classi\ufb01cation errors on the query samples.\nIn a similar way, we can get the query attention map Aq \u2208 Rh\u00d7w. At last, we use a residual attention\nmechanism, where the initial feature maps P and Q are elementwisely weighted by 1 + Ap and\n1 + Aq, to form more discriminative feature maps \u00afP \u2208 Rc\u00d7h\u00d7w and \u00afQ \u2208 Rc\u00d7h\u00d7w, respectively.\nComplexity Analysis. The time and space cost of CAM is mainly on correlation layer. The time\ncomplexity of CAM is O(h2w2c) and the space complexity is O(hwc), which both varies with the\nsize of input feature map. So we insert CAM after the last convolutional layer to avoid excessive cost.\n\n4\n\nCorrelationLayerFusionLayer\ud835\udc50\u00d7\u210e\u00d7\ud835\udc64\ud835\udc50\u00d7\u210e\u00d7\ud835\udc64\ud835\udc5a\u00d7\u210e\u00d7\ud835\udc64\u210e\u00d7\ud835\udc64\ud835\udc50\u00d7\u210e\u00d7\ud835\udc64\ud835\udc50\u00d7\u210e\u00d7\ud835\udc64\uff08a) The Cross Attention Module (CAM) \uff08b) TheFusion Layer in CAMMetaLearner\ud835\udc5a\ud835\udc5a\u00d71Spatial GAPGAP: Global Average Pooling\ud835\udc43\ud835\udc58\ud835\udc44\ud835\udc4f\ud835\udc45\ud835\udc5d\ud835\udc45\ud835\udc5e\ud835\udc34\ud835\udc5d\ud835\udc34\ud835\udc5e\u0d24\ud835\udc43\ud835\udc4f\ud835\udc58\u0d24\ud835\udc44\ud835\udc58\ud835\udc4f\ud835\udc34\ud835\udc5d\ud835\udc34\ud835\udc5e\ud835\udc45\ud835\udc5d\ud835\udc45\ud835\udc5ekernel \ud835\udc98ConvolutionalOperationsoftmax\ud835\udc5f\ud835\udc56\ud835\udc5d(\ud835\udc5f\ud835\udc56\ud835\udc5e)\ud835\udc5a\u00d7\u210e\u00d7\ud835\udc64\u210e\u00d7\ud835\udc64\ud835\udc5a\u00d7\u210e\u00d7\ud835\udc64\u210e\u00d7\ud835\udc64\fFigure 3. The framework of the proposed CAN approach.\n\n4 Cross Attention Network\n\nb as inputs, and produces the class feature map P k = 1|S k|\n\nThe overall Cross Attention Network (CAN) is illustrated in Fig. 3, which consists of three modules:\nan embedding module, a cross attention module and a classi\ufb01cation module. The embedding module\nE consists of several cascaded convolutional layers, which maps an input image x into a feature map\nE(x) \u2208 Rc\u00d7h\u00d7w. Following prototypical network [35], we de\ufb01ne the class feature as the the mean of\nits support set in the embedding space. As shown in Fig. 3, the embedding module E takes the support\nset S and a query sample xq\na\u2208S k E(xs\na)\nand a query feature map Qb = E(xq\nb). Each pair of feature maps (P k and Qb) are then fed through\nthe cross attention module, which highlights the relevant regions and outputs more discriminative\nfeature pairs ( \u00afP k\nModel Training via Optimization. CAN is trained via minimizing the classi\ufb01cation loss on the\nquery samples of training set. The classi\ufb01cation module consists of a nearest neighbor and a global\nclassi\ufb01er. The nearest neighbor classi\ufb01er classi\ufb01es the query samples into C support classes based\non pre-de\ufb01ned similarity measure. To obtain precise attention maps, we constrain each position in\nthe query feature maps to be correctly classi\ufb01ed. Speci\ufb01cally, for each local query feature qb\ni at ith\nposition, the nearest neighbor classi\ufb01er produces a softmax-like label distribution over C support\nclasses. The probability of predicting qb\n\nk) for classi\ufb01cation.\n\nb and \u00afQb\n\n(cid:80)\n\nxs\n\ni as kth class is:\n\nexp(cid:0)\u2212d(cid:0)( \u00afQb\nj=1 exp(cid:0)\u2212d(cid:0)( \u00afQb\n(cid:80)C\n\np(y = k|qb\n\ni ) =\n\nb )(cid:1)(cid:1)\nb )(cid:1)(cid:1) ,\n\nk)i, GAP( \u00afP k\n\nj)i, GAP( \u00afP j\n\n(4)\n\nnq(cid:88)\n\nm(cid:88)\n\nk)i denotes the feature vector in the ith spatial position of \u00afQb\n\nwhere ( \u00afQb\nk, and GAP is the global\naverage pooling operation to get the mean class feature. Note that \u00afQb\nj represent the query\nsample xq\nb from somewhat different views as they are correlated with different support classes. In\nEq. 4, the cosine distance d is calculated in the feature space generated by CAM. The nearest neighbor\nclassi\ufb01cation loss is then de\ufb01ned as the negative log-probability according to the true class label\nb \u2208 {1, 2, . . . , C}:\nyq\n\nk and \u00afQb\n\nL1 = \u2212\n\nlog p(y = yq\n\nb|qb\ni ).\n\n(5)\n\nb=1\n\ni=1\n\nm(cid:88)\n\nnq(cid:88)\n\ni is computed as zb\n\n)i). The global classi\ufb01cation loss is then expressed as:\n\ni \u2208 Rl for each local query feature qb\n\nThe global classi\ufb01er uses a fully connected layer followed by softmax to classify each query sample\namong all available training classes. Suppose there are overall l classes in the training set. The\nclassi\ufb01cation probability vector zb\ni =\nsoftmax(Wc( \u00afQb\nyq\nb\n\n(cid:17)\n(6)\nwhere Wc \u2208 Rl\u00d7c is the weight of the fully connected layer and lq\nb \u2208 {1, 2, . . . , l} is the true global\nclass of xq\nb. Finally, the overall classi\ufb01cation loss is de\ufb01ned as L = \u03bbL1 + L2, where \u03bb is the weight\nto balance the effects of different losses. The network can be trained end-to-end by optimizing L\nwith gradient descent algorithm.\nInductive Inference.\nIn inductive inference phase, the embedding module is directly used for a\nnovel task to extract the class and query feature maps. Then each pair of class and query feature\nmaps are fed into CAM to get the attention weighted features. The global averaging pooling is then\n\nL2 = \u2212\n\n(cid:16)\n\n(zb\n\ni )lq\n\nb=1\n\ni=1\n\nlog\n\n,\n\nb\n\n5\n\nSupport SetEmbeddingQueryCross AttentionModuleSimilarity scoreSoftmaxNearest NeighborclassificationGlobal class classificationSoftmaxLinear\u0d24\ud835\udc43\ud835\udc4f2\u0d24\ud835\udc442\ud835\udc4f\u0d24\ud835\udc43\ud835\udc4f1\u0d24\ud835\udc441\ud835\udc4f\u0d24\ud835\udc43\ud835\udc4f3\u0d24\ud835\udc443\ud835\udc4f\u0d24\ud835\udc43\ud835\udc4f4\u0d24\ud835\udc444\ud835\udc4f\u0d24\ud835\udc43\ud835\udc4f5\u0d24\ud835\udc445\ud835\udc4f\ud835\udc432\ud835\udc431\ud835\udc44\ud835\udc4f\ud835\udc433\ud835\udc44\ud835\udc4f\ud835\udc434\ud835\udc44\ud835\udc4f\ud835\udc435\ud835\udc44\ud835\udc4f\ud835\udc44\ud835\udc4f\ud835\udc461\ud835\udc462\ud835\udc463\ud835\udc464\ud835\udc465\ud835\udc65\ud835\udc4f\ud835\udc5e\fperformed to the outputs of CAM to get the mean class and query features. Finally, the label \u02c6yq\nquery sample xq\n\nb for a\nb is predicted by \ufb01nding the nearest mean class feature under cosine distance metric:\n\n(cid:16)\n\n(cid:17)\n\n\u02c6yq\nb = arg min\n\nk\n\nd\n\nGAP( \u00afQb\n\nk), GAP( \u00afP k\nb )\n\n(7)\n\nTransductive Inference.\nIn few-shot classi\ufb01cation task, each class has very few labeled samples,\nso the class feature can hardly represent the true class distribution. In order to alleviate the problem,\nwe propose a simple and effective transductive inference algorithm which utilizes the unlabeled query\nsamples to enrich the class feature.\nb}nq\nSpeci\ufb01cally, we \ufb01rstly utilize the initial class feature map P k to predict the labels {\u02c6yq\nb=1 of\nb}nq\nthe unlabeled query samples {xq\nb=1 using Eq. 7. Then, we de\ufb01ne a label con\ufb01dence criterion\nb and its nearest class neighbor: cq\nusing the cosine distance between the query sample xq\nb =\nmink d(GAP( \u00afQb\nb, the higher the con\ufb01dence of the predicted\nb )|sb = 1, xq\nb \u2208 Q},\nlabel {\u02c6yq\nb}. Based on this criterion, we can obtain a candidate set D = {(xq\n(cid:80)nq\nwhere sb \u2208 {0, 1} denotes the selection indicator for the query sample xq\nb. The selection indicator\ns \u2208 {0, 1}nq is determined by the top t con\ufb01dent query samples: s = arg min||s||0=t\nb=1 sbcq\nb.\nFinally, the candidate set D along with the support set S is used to generate a more representative\nclass feature map (P k)\u2217:\n\nb )). The lower the value cq\n\nk), GAP( \u00afP k\n\nb, \u02c6yq\n\n\u2217\n\n(P k)\n\n=\n\n1\n\n|S k| + |Dk|\nHere Dk = {(xq\nb = k}. (P k)\u2217 is then used to re-estimate the pseudo label for each\nquery sample. We repeat above process for a certain number of iterations. And the number of selected\nsamples in the candidates set D is gradually increased with a \ufb01xed ratio in each iteration. In this way,\nwe can progressively enrich the class features to be more representative and robust.\n\nb \u2208 D, \u02c6yq\n\nb )|xq\n\nb\u2208Dk\nxq\n\na\u2208Sk\nxs\n\nb, \u02c6yq\n\nE(xs\n\na) +\n\nE(xq\nb)\n\n(8)\n\n\uf8eb\uf8ed (cid:88)\n\n(cid:88)\n\n\uf8f6\uf8f8 .\n\n5 Experiments\n\n5.1 Experiment Setup\n\nDatasets. We use miniImageNet [40] which is a subset of ILSVRC-12 [13]. It contains 100\nclasses with 600 images per class. We use the standard split following [31, 37, 26, 15, 33]: 64 classes\nfor training, 16 for validation and 20 for testing. We also use tieredImageNet dataset [32], a much\nlarger subset of ILSVRC-12 [13]. It contains 34 categories and 608 classes in total. These are divided\ninto 20 categories (351 classes) for training, 6 categories (97 classes) for validation, and 8 categories\n(160 classes) for testing, as in [32, 7, 35, 37].\nExperimental setting. We experiment our approach on 5-way 1-shot and 5-way 5-shot settings.\nFor a C-way K-shot setting, the episode is formed with C classes and each class includes K support\nsamples, and 6 and 15 query samples are used for training and inference respectively. When inference,\n2000 episodes are randomly sampled from the test set. We report the average accuracy and the\ncorresponding 95% con\ufb01dence interval over the 2000 episodes.\nImplementation details.1\nPytorch [28] is used to implement all our experiments on one NVIDIA\n1080Ti GPU. Following [26, 17, 36, 21], we use ResNet-12 network as our embedding module. The\ninput images size is 84 \u00d7 84. During training, we adopt horizontal \ufb02ip, random crop and random\nerasing [49] as data augmentation. SGD is used as the optimizer. Each mini-batch contains 8 episodes.\nThe model is trained for 80 epochs, with each epoch consisting of 1, 200 episodes. For miniImageNet,\nthe initial learning rate is 0.1 and decreased to 0.006 and 0.0012 at 60 and 70 epochs, respectively.\nFor tieredImageNet, the initial learning rate is set to 0.1 with a decay factor 0.1 at every 20 epochs.\nThe temperature hyperparameter (\u03c4 in Eq. 3) is set to 0.025, the reduction ratio in the meta-learner\nis set to 6, and the weight hyperparameter (\u03bb) in the overall loss function is set to 0.5. For the\ntransductive algorithm, the selected number of query samples in the \ufb01rst iteration (t) is set to 35, and\nthe number of iterations and enlarging factor of candidate set are both set to 2. All hyperparameters\nare cross-validated in the validation sets and \ufb01xed afterwards in all experiments.\n\n1The code and models are available on https://github.com/blue-blue272/fewshot-CAN\n\n6\n\n\fTable 1. Comparison to state-of-the-arts with 95% con\ufb01dence intervals on 5-way classi\ufb01cation on miniImageNet\nand tieredImageNet datasets.\nIT: Inference Time per query data in a 5-way 1-shot task on one NVIDIA\n1080Ti GPU. CAN+T denotes CAN with transductive inference. The methods are separated into four groups:\noptimization-based (O), parameter-generating (P), metric-learning (M) and transductive methods (T).\n\nminiImageNet\n\ntieredImageNet\n\n1-shot\n\n51.67 \u00b1 1.81\n66.33 \u00b1 0.05\n65.99 \u00b1 0.72\n\n-\n\n5-shot\n\n70.30 \u00b1 1.75\n81.44 \u00b1 0.09\n81.56 \u00b1 0.53\n\n-\n\n-\n-\n-\n-\n\n-\n-\n\nmodel\nMAML [7]\nMTL [36]\nLEO [33]\nMetaOpt [14]\nMetaNet [20]\nMM-Net [3]\nadaNet [21]\nMN [40]\nPN [35]\nRN [37]\nDN4 [15]\nTADAM [26]\nOur CAN\nTPN [17]\nOur CAN+T\n\nEmbedding\nConvNet\nResNet-12\nWRN-28\nResNet-12\nConvNet\nConvNet\nResNet-12\nConvNet\nConvNet\nConvNet\nConvNet\nResNet-12\nResNet-12\nResNet-12\nResNet-12\n\nO\n\nP\n\nM\n\nT\n\nIT(s)\n0.103\n2.020\n\n0.096\n\n1.371\n0.021\n0.018\n0.033\n0.049\n0.079\n0.044\n\n-\n\n-\n-\n\n-\n-\n\n1-shot\n\n48.70 \u00b1 0.84\n61.20 \u00b1 1.80\n61.76 \u00b1 0.08\n62.64 \u00b1 0.62\n49.21 \u00b1 0.96\n53.37 \u00b1 0.48\n56.88 \u00b1 0.62\n43.44 \u00b1 0.77\n49.42 \u00b1 0.78\n50.44 \u00b1 0.82\n51.24 \u00b1 0.74\n58.50 \u00b1 0.30\n63.85 \u00b1 0.48\n67.19 \u00b1 0.55\n\n59.46\n\n5-shot\n\n-\n\n55.31 \u00b1 0.73\n75.50 \u00b1 0.80\n77.59 \u00b1 0.12\n78.63 \u00b1 0.46\n66.97 \u00b1 0.35\n71.94 \u00b1 0.57\n60.60 \u00b1 0.71\n68.20 \u00b1 0.66\n65.32 \u00b1 0.70\n71.02 \u00b1 0.64\n76.70 \u00b1 0.30\n79.44 \u00b1 0.34\n80.64 \u00b1 0.35\n\n75.65\n\n-\n-\n-\n-\n\n-\n-\n\n53.31 \u00b1 0.89\n54.48 \u00b1 0.93\n\n72.69 \u00b1 0.74\n71.32 \u00b1 0.78\n\n69.89 \u00b1 0.51\n73.21 \u00b1 0.58\n\n-\n\n84.23 \u00b1 0.37\n84.93 \u00b1 0.38\n\n-\n\n5.2 Comparison with State-of-the-arts\n\nTab. 1 compares our method with existing few-shot methods on miniImageNet and tieredImageNet.\nThe comparative methods are categorized into four groups, i.e., optimization-based methods (O),\nparameter-generating methods (P), metric-learning methods (M), and transductive methods (T). Our\nmethod outperforms the optimization-based methods [7, 18, 33, 36]. It is noted that the optimization-\nbased methods need \ufb01ne-tuning on the target task, making the classi\ufb01cation time consuming. On the\ncontrary, our method requires no model updating solves the tasks in an feed-forward manner, which\nis much faster and simpler than above methods and has better results.\nOur method performs better than the parameter-generating methods [20, 21, 3], with an improvement\nup to 7%. These approaches generate the parameters of the feature extractor based on the support set\nand extract the query features adaptively. However, these methods suffer from the high dimensionality\nof the parameter space. Instead, our method uses a cross attention module to adaptively extract the\nsupport and query features, which is computationally lightweight and achieves a better performance.\nOur method belongs to the metric-learning methods. Existing metric-based methods [40, 35, 37,\n26, 15] extract features of support and query samples independently, making the features attend to\nnon-target objects. Instead, our CAN highlights the target object regions and gets more discriminative\nfeatures. Compared to TADAM [26], CAN with almost the same number of parameters achieves 5%\nhigher performance on 1-shot, which demonstrates the superiority of our cross attention module.\nIn the transductive setting, CAN with transduction (CAN+T) outperforms the prior work TPN [17]\nby a large margin, up to 8% and 5% improvements on 1-shot and 5-shot respectively. TPN uses a\ngraph network to propagate the labels of the support set to the query set. In contrast, our algorithm\nselects the top con\ufb01dent query samples to augment the support set, which can explicitly alleviate the\nlow-data problem. In addition, our transductive algorithm can be easily applied to other few-shot\nlearning models, e.g., matching network [40], prototypical network [35] and relation network [37].\nTime complexity comparison. Tab. 1 further compares the time cost of our method to others. Some\nmethods [40, 35, 37, 15, 7] use a 4-layer ConvNet as the backbone thus take relatively lower time\ncost. Even though, our CAN is still comparable even superior to these methods in term of time\ncost, with a performance improvement up to 10%. The others use the same backbone as CAN, but\nrequire following up modules such as model update per task [36, 14], gradient-based parameter\ngeneration [21], or expensive condition generation [26], which all incur more time overhead than\nCAM. Overall, Tab. 1 shows that CAN outperforms other methods without excessive overhead.\n\n5.3 Ablation Study\n\nIn this subsection, we empirically show the effectiveness of each component of CAN. We \ufb01rstly\nintroduce two baselines to be used for comparison. In R12-proto [35], the features from embedding\n\n7\n\n\fTable 2. Ablation study on miniImageNet and complexity comparisons. PN: Parameter Number; GFLOPs: the\nnumber of \ufb02oating-point operations; CIT: CPU Inference Time of a task with 15 query samples per class.\n\n5-way 1-shot\n\n5-way 5-shot\n\nDescription\nR12-proto\nR12-proto-ac\nCAN-NoML-1\nCAN-NoML-2\nCAN\nCAN+T\n\nPN\n\nGFLOPs\n8.04M 101.550\n8.04M 101.550\n8.04M 101.812\n8.04M 101.812\n8.04M 101.813\n8.04M 101.930\n\nCIT\n0.96s\n0.97s\n1.01s\n1.01s\n1.02s\n1.11s\n\naccuracy GFLOPs\n126.938\n126.938\n127.201\n127.201\n127.203\n127.320\n\n55.46\n61.30\n63.55\n63.38\n63.85\n67.19\n\nCIT\n1.25s\n1.26s\n1.29s\n1.30s\n1.31s\n1.43s\n\naccuracy\n\n69.00\n76.70\n78.88\n79.08\n79.44\n80.64\n\nTable 3. The proposed transductive algorithm for other few-shot learning models. * indicates our re-implemented\nresults using the code provided by LwoF [8]. () indicates the results reported in the paper.\n\nModels\nMatching Network [40]\nPrototypical Network [35]\nRelation Network [37]\n\nInductive\n\n1-shot\n\n53.52* (43.77)\n53.68* (49.42)\n50.65* (50.44)\n\n5-shot\n\n66.20* (60.60)\n70.44* (68.20)\n64.18* (65.32)\n\nTransductive\n1-shot\n5-shot\n69.80\n56.31\n71.12\n55.15\n52.40\n65.36\n\nmodule are directly fed to the nearest neighbor classi\ufb01er and the model is trained with nearest neighbor\nclassi\ufb01cation loss. In R12-proto-ac, the only difference from R12-proto is that R12-proto-ac has\nan additional logit head for global classi\ufb01cation (the normal 64-way classi\ufb01cation in miniImageNet\ncase) and the model is trained with the joint of global and nearest neighbor classi\ufb01cation loss.\nIn\ufb02uence of global classi\ufb01cation. The comparison results are shown in Tab. 2. By comparing\nR12-proto-ac to R12-proto, we can \ufb01nd large improvements on both 1-shot (5.8%) and 5-shot (7.7%).\nWe further try another meta-learner matching network (MN) [40]2, and the proposed joint learning\nschema improves MN from 55.29% to 59.14% on 1-shot setting and 67.74% to 73.81% on 5-shot\nsetting. The consistent improvements demonstrate the effectiveness of the joint leaning schema. We\nargue that the global classi\ufb01cation loss provides regularization on the embedding module and forces\nit to perform well on two decoupled tasks, nearest neighbor classi\ufb01cation and global classi\ufb01cation.\nIn\ufb02uence of cross attention module. By comparing our CAN to R12-pro-ac, we observe consistent\nimprovements on both 1-shot and 5-shot scenarios. The reason is that when using the cross attention\nmodule, our model is able to highlight the relevant regions and extract more discriminative feature.\nThe performance gap also provides evidence that (1) conventionally independently extracted features\ntend to focus on non-target region and produce inaccurate similarities. (2) cross attention module can\nhelp to highlight target regions and reduce such inaccuracy with small overhead.\nIn\ufb02uence of meta-learner in CAM. To verify the effectiveness of the meta-learner in CAM, we\ndevelop two variants of CAM without meta-learner. Speci\ufb01cally, one variant named CAN-NoML-1\nsets the kernel w (shown in Fig. 2 (b)) to be a \ufb01xed mean kernel, i.e., performing global average\npooling on the correlation map R to get the attention maps A. The other variant, CAM-NoML-2, sets\nthe kernel w to a vanilla learnable convolutional kernel that remains the same for all input samples.\nAs shown in Tab. 2, both variants outperform R12-proto-ac consistently, which further demonstrates\nthe effectiveness of the proposed cross attention mechanism. The improvements of CAN-NoML-1\nshows the mean of correlators can roughly estimate the relevant semantic information, which furthers\nveri\ufb01es the reasonability of our designed meta-leaner. As seen, CAN outperforms both variants. The\nimprovement can be attributed to the meta-learning schema which learns to adaptively generate the\nkernel w according to the input pair of feature maps.\nIn\ufb02uence of transductive inference algorithm. As shown in Tab. 2, CAN+T greatly improves CAN\nespecially in 1-shot where the low-data problem is more serious. To further verity its effectiveness,\nwe apply it to other few-shot models, i.e., matching network [40], prototypical network [35] and\nrelation network [37]. We re-implement these models using the code provided by [8] to ensure\na fair comparison. As shown in Tab. 3, our algorithm consistently improves the performance of\nthese models, which demonstrates its generalization ability. Nevertheless, the improvements to these\nmodels are inferior to CAN. We argue that CAN can predict more precise pseudo labels for query\nsamples and augment the support set more effectively, thus leading to better performance.\n\n2We re-implement matching network with ResNet-12 as backbone on miniImageNet.\n\n8\n\n\fFigure 4. Class activation mapping (Cam) visualization on a 5-way 1-shot task with 1 query sample per class.\n\nComplexity comparisons. To illustrate the cost of CAN, we report the number of parameters\n(PN), the number of \ufb02oating-point operations (GFLOPs) and the average CPU inference time (CIT)\nfor a 5-way 1-shot and 5-way 5-shot task with 15 query samples per class. As shown in Tab. 2,\nCAN introduces negligible parameters (the parameters W1 and W2 of the meta-learner in CAM)\nand small computational overhead. For example, CAN requires 101.81 GFLOPs for 5-way 1-shot,\ncorresponding to only 0.25% relative increase over original R12-proto-ac. Notably, the correlation\nmap in CAM can be worked out by one matrix multiplication, which occupies less time in GPU\nlibraries. The transductive inference algorithm also introduces small computational overhead (0.37%\non 1-shot and 0.31% on 5-shot) since it directly utilizes the extracted embedding features to regenerate\nthe class feature and only passes the lightweight CAM again.\n\n5.4 Visualization Analysis\n\nTo qualitatively evaluate the proposed cross attention mechanism, we compare the class activation\nmaps [48] visualization results of CAN to other meta-learners, RN [37], MAML [7] and TADAM [26].\nAs shown in Fig. 4 (a), the features of RN usually contain non-target objects since it lacks an explicit\nmechanism for feature adaptation. MAML performs gradient-based adaptation, which makes the\nmodel merely learn some conspicuous discriminative features in the support images without deeping\ninto the intrinsic characteristic of the target objects. As shown in Fig. 4 (b), MAML attends to\nship for the groenendael support image to better distinguish it from the golden retriever category,\nresulting in a confusing location and misclassi\ufb01cation of the groenendael category. TADAM performs\ntask-dependent adaptation and applies the same adaptive parameters to all query images of a task,\nthus it is dif\ufb01cult to locate different target objects for different categories. As shown in Fig. 4 (c),\nTADAM mistakenly attends to the dog for worm fence query image. In contrast, CAN processes the\nquery samples with different adaptive parameters, which allows it to focus on the different target\nobjects for different categories shown in Fig. 4 (d).\n\n6 Conclusion\n\nIn this paper, we proposed a cross attention network for few-shot classi\ufb01cation. Firstly, a cross\nattention module is designed to model the semantic relevance between class and query features. It\ncan adaptively localize the relevant regions and generate more discriminative features. Secondly, we\npropose a transductive inference algorithm to alleviate the low-data problem. It utilizes the unlabeled\nquery samples to enrich the class features to be more representative. Extensive experiments show\nthat our method is far simpler and more ef\ufb01cient than recent few-shot meta-learning approaches, and\nproduces state-of-the-art results.\nAcknowledgement This work is partially supported by National Key R&D Program of China\n(No.2017YFA0700800), Natural Science Foundation of China (NSFC): 61876171 and Beijing\nNatural Science Foundation under Grant L182054.\n\n9\n\nSupport(a) Camof RN (b) Camof MAML (c) Camof TADAM(d) Camof CAN Querygolden retieverworm fencegroenedaelpark benchSeashore\fReferences\n[1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. P. T. Schaul, and N. d. Freitas.\n\nLearning to learn by gradient descent by gradient descent. In NeurIPS, 2018.\n\n[2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward\n\none-shot learners. In NeurIPS, pages 523\u2013531, 2016.\n\n[3] Q. Cai, Y. Pan, T. Yao, C. Yan, and T. Mei. Memory matching networks for one-shot image\n\nrecognition. In CVPR, pages 4080\u20134088, 2018.\n\n[4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. S. Chua. Sca-cnn: Spatial\nand channel-wise attention in convolutional networks for image captioning. In CVPR, pages\n5659\u20135667, 2017.\n\n[5] D D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00f6lkopf. Learning with local and global\n\nconsistency. In NeurIPS, pages 321\u2013328, 2004.\n\n[6] L. Dong-Hyun. Pseudo-label: The simple and ef\ufb01cient semi-supervised learning method for\n\ndeep neural networks. In ICML workshop, 2013.\n\n[7] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. In ICML, pages 1126\u20131135, 2017.\n\n[8] S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR,\n\n2018.\n\n[9] S. Gidaris and N. Komodakis. Generating classi\ufb01cation weights with gnn denoising autoen-\n\ncoders for few-shot learning. In CVPR, 2019.\n\n[10] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen. Interaction-and-aggregation network\n\nfor person re-identi\ufb01cation. In CVPR, pages 9317\u20139326, 2019.\n\n[11] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen. Vrstc: Occlusion-free video person\n\nre-identi\ufb01cation. In CVPR, pages 7183\u20137192, 2019.\n\n[12] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507,\n\n2017.\n\n[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NeurIPS, pages 1097\u20131105, 2012.\n\n[14] K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex\n\noptimization. In CVPR, 2019.\n\n[15] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo. Revisiting local descriptor based image-to-\n\nclass measure for few-shot learning. In CVPR, 2019.\n\n[16] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few-shot learning.\n\narXiv preprint arXiv:1707.09835, 2017, 2017.\n\n[17] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang. Learning to propagate\n\nlabels: Transductive propagation network for few-shot learning. In ICLR, 2018.\n\n[18] Y. Liu, Q. Sun, A. A. Liu ad Y. Su, B. Schiele, and T. S. Chua. Lcc: Learning to customize and\n\ncombine neural networks for few-shot learning. arXiv preprint arXiv:1904.08479, 2019.\n\n[19] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In\n\nICLR, 2018.\n\n[20] T. Munkhdalai and H. Yu. Meta networks. In ICML, pages 2554\u20132563, 2017.\n\n[21] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler. Rapid adaptation with conditionally shifted\n\nneurons. In ICML, 2018.\n\n10\n\n\f[22] D. K. Naik and R. J. Mammone. Meta-neural networks that learn by learning. In IJCNN, pages\n\n437\u2013442, 1992.\n\n[23] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nICML, 2010.\n\n[24] A. Nichol, J. Achiam, and J. Schulman. On \ufb01rst-order meta-learning algorithms. arXiv preprint\n\narXiv:1803.02999, 2018.\n\n[25] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. J. Goodfellow. Realistic evaluation of\n\ndeep semi-supervised learning algorithms. In NeurIPS, 2018.\n\n[26] B. Oreshkin, P. R. Lopez, and A. Lacoste. Tadam: Task dependent adaptive metric for improved\n\nfew-shot learning. In NeurIPS, pages 719\u2013729, 2018.\n\n[27] J. Park, S. Woo, J. Y. Lee, and I. S. Kweon. Bam: Bottleneck attention module. In BMVC, 2018.\n\n[28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS workshop, 2017.\n\n[29] M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek. Areas of attention for image captioning. In\n\nICCV, pages 1251\u20131259, 2017.\n\n[30] G. Peng, L. Hongsheng, L. Shuang, L. Pan, L. Yikang, H. Steven CH, and W. Xiaogang.\nQuestion-guided hybrid convolution for visual question answering. In ECCV, pages 469\u2013485,\n2018.\n\n[31] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.\n\n[32] M. Ren, E. Trianta\ufb01llou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle,\nand R. S. Zemel. Meta-learning for semi-supervised few-shot classi\ufb01cation. arXiv preprint\narXiv:1803.00676, 2018.\n\n[33] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell.\n\nMeta-learning with latent embedding optimization. In ICLR, 2019.\n\n[34] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with\n\nmemory-augmented neural networks. In ICML, pages 1842\u20131850, 2016.\n\n[35] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In NeurIPS,\n\npages 4077\u20134087, 2017.\n\n[36] Q. Sun, Y. liu, T. S. Chua, and B. Schiele. Meta-transfer learning for few-shot learning. In\n\nCVPR, 2019.\n\n[37] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare:\n\nRelation network for few-shot learning. In CVPR, pages 1199\u20131208, 2018.\n\n[38] S. Thrun. Lifelong learning algorithms. In Learning to Learn, pages 181\u2013209, 1998.\n\n[39] S. Thrun and L. Pratt. Learning to learn: Introduction and overview. In Learning to Learn,\n\npages 3\u201317, 1998.\n\n[40] O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. Matching networks for one shot learning.\n\nIn NeurIPS, pages 3630\u20133638, 2016.\n\n[41] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon. Cbam: Convolutional block attention module. In\n\nECCV, pages 3\u201319, 2018.\n\n[42] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Quyang, and Y. Yang. Exploit the unknown gradually:\nOne-shot video-based person re-identi\ufb01cation by stepwise learning. In CVPR, pages 5177\u20135186,\n2018.\n\n[43] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for\n\nvisual question answering. In ECCV, pages 451\u2013466, 2016.\n\n11\n\n\f[44] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show,\n\nattend and tell: Neural image caption generation with visual attention. 2015.\n\n[45] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question\n\nanswering. In CVPR, pages 21\u201329, 2016.\n\n[46] D. Yu, J. Fu, T. Mei, and Y. Rui. Multi-level attention networks for visual question answering.\n\nIn CVPR, pages 4709\u20134717, 2017.\n\n[47] L. Zhao, X. Li, J. Wang, and Y. Zhuang. Deeply-learned part-aligned representations for person\n\nre-identi\ufb01cation. In ICCV, pages 3239 \u2013 3248, 2017.\n\n[48] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for\n\ndiscriminative localization. In CVPR, pages 2921\u20132929, 2016.\n\n[49] Z. Zhun, Z. Liang, K. Guoliang, L. Shaozi, and Y. Yi. Random erasing data augmentation.\n\narXiv preprint arXiv:1708.04896, 2017.\n\n12\n\n\f", "award": [], "sourceid": 2213, "authors": [{"given_name": "Ruibing", "family_name": "Hou", "institution": "Institute of Computing Technology\uff0cChinese Academy"}, {"given_name": "Hong", "family_name": "Chang", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"given_name": "Bingpeng", "family_name": "MA", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "Shiguang", "family_name": "Shan", "institution": "Chinese Academy of Sciences"}, {"given_name": "Xilin", "family_name": "Chen", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}]}