{"title": "Generalized Zero-Shot Learning with Deep Calibration Network", "book": "Advances in Neural Information Processing Systems", "page_first": 2005, "page_last": 2015, "abstract": "A technical challenge of deep learning is recognizing target classes without seen data. Zero-shot learning leverages semantic representations such as attributes or class prototypes to bridge source and target classes. Existing standard zero-shot learning methods may be prone to overfitting the seen data of source classes as they are blind to the semantic representations of target classes. In this paper, we study generalized zero-shot learning that assumes accessible to target classes for unseen data during training, and prediction on unseen data is made by searching on both source and target classes. We propose a novel Deep Calibration Network (DCN) approach towards this generalized zero-shot learning paradigm, which enables simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes. Our approach maps visual features of images and semantic representations of class prototypes to a common embedding space such that the compatibility of seen data to both source and target classes are maximized. We show superior accuracy of our approach over the state of the art on benchmark datasets for generalized zero-shot learning, including AwA, CUB, SUN, and aPY.", "full_text": "Generalized Zero-Shot Learning with Deep\n\nCalibration Network\n\nShichen Liu\u2020, Mingsheng Long\u2020((cid:66)), Jianmin Wang\u2020, and Michael I. Jordan(cid:93)\n\u2020KLiss, MOE; BNRist; Research Center for Big Data, Tsinghua University, China\n\n\u2020School of Software, Tsinghua University, China\n\n(cid:93)University of California, Berkeley, Berkeley, USA\n\nliushichen95@gmail.com\n\n{mingsheng, jimwang}@tsinghua.edu.cn\n\njordan@berkeley.edu\n\nAbstract\n\nA technical challenge of deep learning is recognizing target classes without seen\ndata. Zero-shot learning leverages semantic representations such as attributes or\nclass prototypes to bridge source and target classes. Existing standard zero-shot\nlearning methods may be prone to over\ufb01tting the seen data of source classes as they\nare blind to the semantic representations of target classes. In this paper, we study\ngeneralized zero-shot learning that assumes accessible to target classes for unseen\ndata during training, and prediction on unseen data is made by searching on both\nsource and target classes. We propose a novel Deep Calibration Network (DCN)\napproach towards this generalized zero-shot learning paradigm, which enables\nsimultaneous calibration of deep networks on the con\ufb01dence of source classes and\nuncertainty of target classes. Our approach maps visual features of images and\nsemantic representations of class prototypes to a common embedding space such\nthat the compatibility of seen data to both source and target classes are maximized.\nWe show superior accuracy of our approach over the state of the art on benchmark\ndatasets for generalized zero-shot learning, including AwA, CUB, SUN, and aPY.\n\n1\n\nIntroduction\n\nRemarkable advances in object recognition has been achieved in recent years with the prosperity of\ndeep convolutional neural networks [7, 29, 46, 24]. Despite the exciting advances, most successful\nrecognition models are based on supervised deep learning, which often requires large-scale labeled\nsamples to learn category concept and visual representation [44]. This addiction of deep learning\nto large-scale labeled data has limited well-established deep models to tasks with only thousands\nof classes, while recognizing objects \u201cin the wild\u201d without labeled training data remains a major\ntechnical challenge to arti\ufb01cial intelligence. In many real applications, objects in different categories\nmay follow a long-tailed distribution that some popular categories have a large number of training\nimages while other categories have few or even no training images. Furthermore, collecting and\nannotating a large-scale set of representative examplar images for target categories is expensive and\nin many cases prohibitive [14]. In contrast to deep learning, humans can learn from few examples, by\neffectively transferring knowledge from relevant categories, and even recognize unseen objects [19].\nSuch capability of humans has motivated the active research in one-shot and zero-shot learning [8].\nIt is thus imperative to design versatile algorithms for zero-shot learning, which extends classi\ufb01ers\nfrom source categories, of which labeled images are available during training, to target categories,\nof which labeled images are not accessible [32, 37]. Zero-shot learning (ZSL) has attracted wide\nattention in various research areas including face veri\ufb01cation [30], object recognition [32], and video\nunderstanding [52]. The main idea of zero-shot learning is to recognize objects of target classes by\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Technical dif\ufb01culty for generalized zero-shot learning. Filled circles denote the semantic\nrepresentations of all classes, empty markers denote seen data of source classes, and unseen data of\ntarget classes are unavailable. Given each training image, e.g. from a seen class dog denoted by the\norange triangle, we can classify it to either source or target classes (the probability bars). However,\nover\ufb01tting the seen data to source classes will lead to uncertain predictions over target classes for\neither seen and unseen data, which potentially hurts the generalized zero-shot learning performance.\n\ntransferring knowledge from source classes through semantic representations of source and target\nclasses, typically in the form of attributes [32] and class prototypes [8]. Towards this goal, we need to\naddress two technical problems as envisioned in [8]: (i) how to relate target classes to source classes\nfor knowledge transfer and (ii) how to make prediction on target classes without labeled training data.\nTowards the \ufb01rst problem, visual attributes [12, 32, 38] and word embeddings [15, 35, 47] have been\nexplored as semantic representations to correlate source and target classes. Most works leverage\nclass embeddings directly as bridges between input images and output classes [1, 15, 36, 32, 42, 19],\nwhile others directly learn representations from class embeddings [17, 27, 60]. Towards the second\nproblem, the probabilistic models [32] have been strong baselines for zero-shot learning. Classi\ufb01ers\nfor the target classes can be directly trained in the input feature space [1, 11, 34, 60] or can be trained\nin semantic space using the nearest prototype classi\ufb01ers [15, 36, 17, 20, 19]. A recent method of\nsynthesized classi\ufb01ers (SynC) [8] exploited the existence of clustering structures in the semantic\nembedding space and constrained the two aligned manifolds of clusters corresponding to the semantic\nembeddings and the centers in the visual feature space. The work [9] further imposed the structural\nconstraint that semantic representations must be predictive of the locations of their corresponding\nvisual exemplars. These two representative approaches yielded state-of-the-art performance on ZSL.\nDespite the encouraging advances in standard zero-shot learning where only source classes with seen\ndata are available during training, their basic assumption that the unseen data only comes from target\nclasses is not realistic for real applications, since unseen data may come naturally from both source\nand target classes. As a consequence, generalized zero-shot learning setting where prediction on\nunseen data is made over both source and target classes, has been proposed \ufb01rstly in [10] and drawn\nincreasing attention [45, 54]. Unfortunately, existing standard zero-shot learning methods may be\nprone to over\ufb01tting the seen data of source classes as they are blind to the semantic representations of\ntarget classes. Figure 1 intuitively demonstrates such a technical dif\ufb01culty: as the seen data are over\ufb01t\nto the source classes, they are made distant away from the target classes, and the resultant model tend\nto make uncertain predictions for the unseen data over both source classes and target classes. Hence,\nhow to extend deep networks to the generalized zero-shot learning setting remains an open problem.\nTowards the above technical dif\ufb01culty of generalized zero-shot learning, we propose a novel Deep\nCalibration Network (DCN) approach that enables simultaneous calibration of deep networks on the\ncon\ufb01dence of source classes and the uncertainty of target classes. The uncertainty of target classes is\nthe main obstacle to generalized zero-shot learning, which is calibrated by the entropy minimization\nprinciple. The overcon\ufb01dence of source classes is a side-effect of the high-capacity deep networks,\nwhich can be calibrated by the temperature distillation method [23]. We propose an end-to-end deep\n\n2\n\ncon\ufb01dencecalibrationuncertaintycalibration++++++++++++++++++++catdog+++++source classestrucktarget classesbirdairplaneprobabilityprobabilityprobabilityprobability\farchitecture comprised of deep convolutional neural networks (CNN) and multilayer perceptrons\n(MLP) with new loss functions to enable end-to-end training by back-propagation, with which we map\nvisual features of images and semantic representations of class prototypes to a common embedding\nspace such that the compatibility of seen data to the target classes are maximized. We show superior\naccuracy of our approach over the state of the art on benchmark datasets (AwA, CUB, SUN, and\naPY) for both generalized and standard zero-shot learning under rigorous evaluation protocols [54].\n\n2 Related Work\n\nZero-shot learning models recognize target classes with no training samples through transferring\nknowledge from source classes with abundant training samples. Zero-shot learning was \ufb01rst studied\nby exploring visual attributes as semantic descriptions [12, 1, 16, 17, 30, 38, 32, 51, 4, 56]. This,\nhowever, requires manually-annotated attribute prototypes for each class, which is very costly for\nlarge-scale datasets. Recently, semantic word embeddings [35] are proposed to embed each class\nword by unsupervised learning from large-scale general text database, eliminating the need of human\nannotation of many visual attributes [47, 3, 15, 20, 36]. A combination of several embedding methods\nhave also been investigated to improve the representation power for diverse category labels [3, 17, 20].\nBased on semantic embeddings of class prototypes, existing zero-shot learning work can be generally\ncategorized as embedding-based and similarity-based methods. In the embedding-based methods, one\n\ufb01rst maps visual features to the semantic space, and then predicts the class labels by various similarity\nmeasures implied by the class embeddings [1, 3, 15, 17, 20, 27, 32, 33, 36, 47, 51]. Some recent work\ncombines these two stages for more accurate predictions [1, 3, 15, 42, 51, 59, 60]. In addition to the\n\ufb01xed semantic embeddings, some work maps them into a different space through subspace learning\nor feature learning [17, 27, 59, 61]. In the similarity-based methods, classi\ufb01ers are built for target\nclasses by relating them to the source classes through class-wise similarities [11, 20, 21, 34, 40, 41].\nTowards the generalized zero-shot learning (GZSL) paradigm, Chao et al. [10] \ufb01rst studied this\nproblem and tackled it with a calibrated stacking method to set different prediction thresholds for seen\nand unseen classes. Unlike setting thresholds for predictions, our method could learn a uni\ufb01ed model\nthrough calibrating the uncertainty. Xian et al. [53] and Verma et al. [31] explored approaches that\nsynthesize images for unseen classes using GANs [22] or VAE [26]. Despite their strong performance,\nthese generative models are more dif\ufb01cult to train. Our approach signi\ufb01cantly boosts the accuracy\nof GZSL in the regime of discriminative models, which attains the simplicity of model design. Our\nwork is well complementary to the generative models based approaches. Another related setting is\ntransductive zero-shot learning [18, 49] which exploits the unlabeled target data at training. While our\napproach does not require such data, it requires the availability of the target classes during training.\n\n3 Generalized Zero-Shot Learning\nIn zero-shot learning, we are given seen data D = {(xn, yn)}N\nn=1 of N labeled points with labels\nfrom the source classes S = {1, . . . , S}, where xn \u2208 RP is the feature of the n-th image and yn \u2208 S\nis the label. Denote by T = {S + 1, . . . , S + T} the target classes, where no seen data is available\nin the training phase. For each class c \u2208 S \u222a T , denote by ac \u2208 RQ its semantic representation\n(attributes or word embeddings), and by A = {ac}S+T\nc=1 the set of semantic representations. In the test\nphase, we predict unseen data D(cid:48) = {xm}N +M\nm=N +1 of M points from either source or target classes.\nc=1, classify D(cid:48) over target classes T .\nDe\ufb01nition 1 (Zero-Shot Learning, ZSL). Given D and {ac}S\nDe\ufb01nition 2 (Generalized Zero-Shot Learning, GZSL). [10] Given D and {ac}S+T\nc=1 of both source\nand target classes, learn a model f : x (cid:55)\u2192 y to classify D(cid:48) over both source and target classes S \u222a T .\nAt \ufb01rst sight, it seems that the assumption of available semantic representation for the target classes\n{ac}S+T\nc=S+1 is too strong. But note that in generalized zero-shot learning we have to make predictions\non unseen data over both source and target classes, and if we do not have access to the target classes,\nthis generalized setting is highly undetermined and cannot be solved from a statistical perspective.\nSince we still hold the essential assumption of zero-shot learning that unseen data is not available in\nthe training phase, as well as the semantic representations of target classes are not expensive to have,\nwe believe this assumption is the mildest one we could have for real zero-shot learning applications.\nFigure 2 shows the proposed Deep Calibration Network (DCN) for generalized zero-shot learning.\n\n3\n\n\fFigure 2: Deep Calibration Network (DCN) for generalized zero-shot learning, comprised of four\nmodules: (i) a CNN for learning deep embedding \u03c6(x) for each image x and an MLP for learning\ndeep embedding \u03c8(a) for each class a; (ii) a prediction function f made by nearest prototype classi\ufb01er\n(NPC); (iii) two probabilities p and q that transform the prediction function f into distributions over\nthe source and target classes; (iv) a cross-entropy loss that minimizes the overcon\ufb01dence of source\nclasses, and an entropy loss that minimizes the uncertainty of target classes. Best viewed in color.\n\n3.1 Prediction Function\n\nThe architecture of DCN consists of a visual model and a text model. We adopt deep convolutional\nnetworks as our visual models, e.g. GoogLeNet [48] and ResNet [24]. Deep convolutional networks\ncan represent each image x \u2208 D by an feature embedding \u03c6(x) \u2208 RK, where K is dimension of the\nembedding space. We adopt multilayer perceptrons (MLP) [43] as our text model, which can learn\ndeep representation \u03c8(a) \u2208 RK for each class a \u2208 A in the same K-dimensional embedding space.\nEach semantic class a \u2208 A can be represented by either word embeddings generated by Word2Vec\n[35], or by visual attributes annotated by humans to describe the visual patterns [32]. Note that, both\nimages x \u2208 D and semantics a \u2208 A are mapped to the common K-dimensional embedding space as\n(1)\nwhere zn \u2208 RK and vc \u2208 RK are visual embedding of images and semantic embedding of classes.\nBoth networks utilize the nonlinearity function tanh that squash the predicted values within [\u22121, 1].\nClassi\ufb01cation can be made in the shared embedding space by the nearest prototype classi\ufb01er (NPC),\nwhich assigns to an image xn the class label c \u2208 S\u222aT whose semantic embedding \u03c8(ac) is closest in\nsimilarity to the visual embedding \u03c6(xn). Denote by f the prediction function of the NPC classi\ufb01er,\n(2)\nwhere sim(\u00b7,\u00b7) is a similarity function, e.g. inner product and cosine similarity; fc(xn) is the strength\nthat NPC classi\ufb01er assigns image xn to class c. The tanh activation function further strengthens the\nnonlinearity of the nearest prototype classi\ufb01er. The predicted class y(xn) of image xn is given by\n\nfc (xn) = sim (\u03c6 (xn) , \u03c8 (ac)) ,\n\nzn = \u03c6 (xn) , vc = \u03c8 (ac) ,\n\n(3)\nwhere prediction is made over both source and target classes as c \u2208 S \u222a T in generalized zero-shot\nlearning. Note that making predictions for different classes lead to different technical dif\ufb01culties [54].\n\ny (xn) = arg maxc fc (xn) ,\n\n3.2 Risk Minimization\nWith the prediction f (xn) and class labels yn in the seen data D, one can perform classi\ufb01cation by\napplying any well-established loss functions. We take the multi-class Hinge loss as an example:\n\nN(cid:88)\n\nS(cid:88)\n\nmax (0, \u2206 (yn, c) + fc (xn) \u2212 fyn (xn)),\n\n(4)\n\nn=1\n\nc=1\n\nwhere the margin is de\ufb01ned as \u2206 (yn, c) = 0 if yn = c and 1 otherwise. Most zero-shot learning\nmethods (based on DeViSE [15]) use the multi-class Hinge loss to learn visual-semantic embeddings.\nHowever, a very recent study [23] reveals that deep neural networks are no longer well-calibrated\nto the prediction con\ufb01dence: the probability given by the Softmax outputs tends to make absolute\n\n4\n\nimageCNNMLPclasscross-entropypredictionfunctionprobabilityS1sourceclassesyS+1entropyS+Tftargetclassesxa\u03c6(x)\u03c8(a)pq\fcon\ufb01dent predictions, i.e. the maximum value of each Softmax output is close to 1, much higher than\nreal con\ufb01dences. Despite the higher accuracy of high-capacity deep models on many tasks, over\ufb01tting\nto the seen data of source classes may hurt transfer to target classes for generalized zero-shot learning.\nTowards the above miscalibration of deep networks, we apply temperature calibration to mitigate\nthe overcon\ufb01dence to the source classes caused by over\ufb01tting the seen data. Temperature calibration\nwas originally proposed by Hinton et al. [25] to distill knowledge from deep networks. We apply\ntemperature calibration to transform prediction f into probability distribution over source classes as\n\npc (xn) =\n\n,\n\n(5)\n\nS(cid:80)\n\nc(cid:48)=1\n\nexp (fc (xn) /\u03c4 )\n\nexp (fc(cid:48) (xn) /\u03c4 )\n\nL = \u2212 N(cid:88)\n\nS(cid:88)\n\nwhere \u03c4 is the temperature, and \u03c4 = 1 is a common option in deep networks. Temperature \u03c4 \u201csoftens\u201d\nthe softmax (raises the output entropy) with \u03c4 > 1. As \u03c4 \u2192 \u221e, the probability pc \u2192 1/S, which\nleads to maximum uncertainty. As \u03c4 \u2192 0, the probability collapses to a point mass (i.e. pc = 1).\nSince \u03c4 does not change the maximum of the softmax function, the class prediction remains unchanged\nif temperature calibration \u03c4 (cid:54)= 1 is applied after convergence. However, note in this paper that the\ntemperature calibration \u03c4 (cid:54)= 1 is applied over the nearest prototype classi\ufb01er (2) to encourage better-\ncalibrated probabilities, which does avoid deep models from generating overcon\ufb01dent predictions.\nUnlike the distillation strategy [25, 23], we apply the temperature calibration \u03c4 (cid:54)= 1 during training.\nSuch a calibrated probability is essentially useful when we calibrate the uncertainty of target classes.\nPlugging the probability pc (5) into cross-entropy loss over seen data D from source classes S yields\n\nyn,c log pc (xn).\n\n(6)\n\nIt is worth noting that, comparing with the family of methods based on the multi-class Hinge loss in\nEq. (4), this work differs in adopting the cross-entropy objective, which is a native solution to deal\nwith multi-class problems and can take the power of temperature calibration to mitigate over\ufb01tting.\n\nn=1\n\nc=1\n\n3.3 Uncertainty Calibration\n\nAs common practice of standard zero-shot learning, we can apply the trained model fc in Eq. (3)\nto classify the unseen data over only target classes T . However, in generalized zero-shot learning,\nthings become more dif\ufb01cult as we have to make predictions over both source and target classes\nS \u222a T . Due to the huge gap between the disjoint source and target classes, a deep network trained on\nthe source classes may still be under-con\ufb01dent for the target classes. In principle, this generalized\nzero-shot learning paradigm will be impossible without exploiting any structures from target classes.\nIn this paper, we enable generalized zero-shot learning by making our model unblind to target classes.\nIt is important to note that the data associated with target classes are still inaccessible during training,\nso as to comply with the most important assumption of zero-shot learning. To remove the blindness,\nwe \ufb01rst transform prediction fc into probability (with temperature calibration) over target classes as\n\nqc (xn) =\n\nexp (fc (xn) /\u03c4 )\n\n,\n\n(7)\n\nS+T(cid:80)\n\nc(cid:48)=S+1\n\nexp (fc(cid:48) (xn) /\u03c4 )\n\nNote that the temperature calibration \u03c4 (cid:54)= 1 is applied during the end-to-end training of both Eq. (6)\nand Eq. (7). This strategy is enabled in the stochastic back-propagation procedure of deep networks.\nIntuitively, probability qc should assign each seen image to the target classes as certain as possible,\nnamely, classify it to the target classes that are most similar to the image\u2019s label in the source classes,\nrather than classify it to all target classes with equal uncertainty. In other words, we should minimize\nthe uncertainty of classifying the seen data to target classes such that source classes and target classes\nare both made compatible with the seen data and thus comparable for generalized zero-shot learning.\nIn information theory, entropy h(q) = \u2212q log q is the basic measure of uncertainty of a distribution q.\nIn this paper, we propose the objective for uncertainty calibration based on the entropy criterion as\n\nqc (xn) log qc (xn).\n\n(8)\n\nH = \u2212 N(cid:88)\n\nS+T(cid:88)\n\nn=1\n\nc=S+1\n\n5\n\n\fNote that making predictions of seen data over target classes as certain as possible does not imply\nmaking target classes as the predictions of seen data. Hence, the uncertainty calibration strategy will\nsigni\ufb01cantly improve prediction over target classes while having little harm on classifying seen data.\n\n3.4 Deep Calibration Network\n\nThe optimization problem of the deep calibration network (DCN) for generalized zero-shot learning\ncan be formulated by integrating the empirical risk minimization (6) and uncertainty calibration (8):\n(9)\n\nL + \u03bbH + \u03b3\u2126 (\u03c6, \u03c8) ,\n\nmin\n\u03c6,\u03c8\n\nwhere \u2126(\u03c6, \u03c8) is the penalty to control model complexity, and \u03bb and \u03b3 are hyper-parameters. In deep\nlearning, weight decay can be used to replace the penalty term \u03b3\u2126(\u03c6, \u03c8), thus we need not explicitly\ndeal with this penalty term. After model convergence, we can apply the prediction function fc in\nEq. (3) to make predictions on the unseen data over both source and target classes. The network\nparameters {\u03c6, \u03c8} can be ef\ufb01ciently optimized by standard back-propagation with auto-differentiation\ntechnique supported in PyTorch1. It is worth noting that, by changing the deep architectures to\nother shallow methods (e.g. logistic regression), our approach can be readily applied to existing\n(generalized) zero-shot learning methods, provided that the outputs are probability distributions [5, 6].\n\n4 Experiments\n\nWe perform extensive evaluation with state of the art methods for zero-shot and generalized zero-shot\nlearning on four benchmark datasets, which will validate the ef\ufb01cacy of the proposed DCN approach.\n\n4.1 Setup\n\nDatasets The statistics of the four benchmark datasets for zero-shot learning are shown in Table 1.\nAnimals with Attributes (AwA) [32] is a widely-used dataset for coarse-grained zero-shot learning,\ncontaining 30,475 images from 50 different animal classes with at least 92 labeled examples per class.\nA standard split into 40 source classes and 10 target classes is provided by the dataset creators [32].\nCaltech-UCSD-Birds-200-2011 (CUB) [50] is a \ufb01ne-grained dataset with large number of classes and\nattributes, containing 11,788 images from 200 different types of birds annotated with 312 attributes.\nThe \ufb01rst zero-shot split of CUB with 150 source classes and 50 target classes was introduced in [2].\nSUN Attribute (SUN) [39] is a \ufb01ne-grained dataset, medium-scale in the number of images, containing\n14,340 images from 717 types of scenes annotated with 102 attributes. We adopt the standard split of\n[32], containing 645 source classes (in which 65 classes are used for validation) and 72 target classes.\nAttribute Pascal and Yahoo (aPY) [13] is a small-scale dataset with 64 attributes and 32 classes. We\nfollow split in [54] and use 20 Pascal classes as source classes and 12 Yahoo classes as target classes.\n\nDataset\nSUN\nCUB\nAwA\naPY\n\n# Attributes\n\n102\n312\n85\n64\n\n# Source Classes\n\n645\n150\n40\n20\n\n# Target Classes\n\n72\n50\n10\n12\n\n# Images\n14,340\n11,788\n30,475\n18,627\n\nTable 1: Statistics of the four zero-shot learning datasets.\n\nImage features Due to variations in image features used by different zero-shot learning methods,\nwe opt to a fair comparison with state of the art methods methods based on widely-used features:\n2048-dimensional ResNet-101 features provided by [54] and 1024-dimensional GoogLeNet features\nprovided by [8]. Classi\ufb01cation accuracies of existing methods are directly reported from their papers.\n\nClass embeddings Class embeddings are important for zero-shot learning. As class embeddings\nfor aPY, AwA, CUB and SUN, we use the per-class continuous attributes provided with the datasets.\nNote that the proposed method can also use the Word2Vec representations [35] as class embeddings.\n\n1https://pytorch.org\n\n6\n\n\fComparison methods We choose to compare with many competitive or representative methods,\nincluding shallow methods DAP [32], ALE [1], SJE [3], ESZSL [42], ConSE [36], SynC [8],\nEXEM [9], ZSKL [57], and deep methods DeViSE [15], CMT [47], MTMDL [55], DEM [58].\n\nModel variants We further study two variants of the proposed DCN approach to justify the ef\ufb01cacy\nof the entropy and temperature calibration strategies respectively: (i) DCN w/o E is the variant of\nDCN without entropy calibration, i.e. setting \u03bb = 0 in Equation (9). (ii) DCN w/o ET is the variant\nof DCN without entropy and temperature calibrations, i.e. setting \u03bb = 0 and \u03c4 = 1 in Equation (9).\n\nProtocols Due to variations in the evaluation protocols for different methods, we conduct extensive\nexperiments based on two typical protocols: Standard Protocol [32] and Rigorous Protocol2 [54].\nFor benchmark datasets and standard splits, we follow exactly the Standard Protocol [32, 8] for the\nAwA, CUB and SUN datasets using GoogLeNet features as well as the Standard Splits (SS), which\nenables a direct comparison with the published results. For CUB and SUN, we average the results of\n4 and 10 splits provided by [8] respectively. We report the average per-image classi\ufb01cation accuracy\nbased on \ufb01ve random experiments, where the accuracy is computed over the images in target classes.\nFor deep methods, we use GoogLeNet-V2 as base network on standard splits of AwA and CUB [58].\nRecent study [54] has shown that Standard Protocol is not rigorous to benchmark zero-shot learning.\nA new Rigorous Protocol is proposed for three reasons: (a) The image features are ImageNet pre-\ntrained ResNet-101 features, which have higher accuracy than GoogLeNet features, yielding more\nstable comparison across different methods; (b) The Proposed Split (PS) guarantee that no target\nclasses are from ImageNet-1K since it is used to pre-train the base network, otherwise unfairness\nwould be introduced; (c) The zero-shot performance is evaluated based on per-class (instead of per-\nimage) classi\ufb01cation accuracy Eq. (10), which accounts for the imbalances in the target classes. We\nthus adopt this Rigorous Protocol [54] for fair comparison on AwA, CUB, SUN, and aPY datasets.\n\nACCC =\n\n1\n|C|\n\n#correctly predicted samples in class c\n\n#samples in class c\n\nc\u2208C\n\n.\n\n(10)\n\n(cid:88)\n\nAt test phase of zero-shot learning, test images are restricted to target classes and the search space is\nrestricted to the target classes C = T . At the test phase of generalized zero-shot learning, the search\nspace includes both source and target classes C = S \u2229 T , a more practical but challenging setting.\nWe compute two accuracies: ACCts, accuracy of all unseen images in target classes; ACCtr, accuracy\nof some seen images from source classes which are not used for training, as used by the Proposed\nSplit in the Rigorous Protocol [54]. Then we compute the harmonic mean of the two accuracies as\n\nACCH =\n\n2ACCts \u00d7 ACCtr\nACCts + ACCtr\n\n.\n\n(11)\n\nWe choose harmonic mean as \ufb01nal criterion to favor high accuracies on both source and target classes.\n\nImplementation details Our end-to-end trainable approach is implemented using PyTorch. We\nuse a single-layer Multilayer Perceptron (MLP) to transform the attributes to the common embedding\nspace without changing their dimensions; We add a FC layer at the top of CNNs (GoogLeNet-v2\n& ResNet-101) to transform image representations to the same dimensions as the attributes. We\ntrain the FC layers and \ufb01ne-tune the CNNs end-to-end by optimizing (9). We use stochastic gradient\ndescent with 0.9 momentum and a mini-batch size of 64. We cross-validate the learning rate in\n[10\u22124, 10\u22121], the temperature \u03c4 \u2208 [0.1, 10], and the entropy-penalty parameter \u03bb \u2208 [10\u22123, 10\u22121]. To\nmake comparison more readable and direct, we \ufb01rst discuss the results of standard zero-shot learning.\n\n4.2 Results of Standard Zero-Shot Learning\n\nStandard Splits (SS) The results on Standard Splits [32] of the three datasets are reported in\nTable 2. Sha et al. [8, 9] provided very comprehensive results on typical ZSL methods under rigorous\nevaluation protocols, we thus adopted their published results for direct comparison. We highlight\nthe following results. (i) DCN signi\ufb01cantly outperforms both shallow and deep zero-shot learning\nmethods. This justi\ufb01es the signi\ufb01cance of the proposed cross-entropy loss (6) over the temperature-\ncalibrated Softmax probability to train deep networks for zero-shot learning. (ii) DCN further\n\n2http://www.mpi-inf.mpg.de/zsl-benchmark\n\n7\n\n\fType\n\nShallow\n\nDeep\n\nMethod\nDAP [32]\nALE [1]\nSJE [3]\nESZSL [42]\nConSE [36]\nSynC [8]\nEXEM [9]\nDeViSE* [15]\nCMT* [47]\nMTMDL* [55]\nDCN w/o ET\nDCN w/o E\n\n44.5\n53.8\n56.1\n8.7\n51.9\n62.7\n67.3\n\nAwA CUB SUN Avg\n60.5\n48.0\n49.5\n53.8\n56.3\n66.3\n37.4\n59.6\n50.5\n63.3\n72.9\n63.4\n76.5\n67.4\n56.7\n60.8\n63.7\n82.0\n82.3\n\n39.1\n40.8\n46.5\n44.0\n36.2\n54.5\n58.5\n33.5\n39.6\n32.3\n52.3\n55.6\n\n66.9\n68.4\n\n\u2013\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\n66.4\n67.4\n\nTable 2: Results of Standard Zero-Shot Learning on SUN, CUB, and AwA using Standard Split (SS)\nwith GoogLeNet features [8]. *indicates the deep methods that are \ufb01ne-tuned from GoogLeNet-v2.\n\nType\n\nShallow\n\nDeep\n\nMethod\nDAP [32]\nALE [1]\nSJE [3]\nESZSL [42]\nSAE [28]\nConSE [36]\nSynC [8]\nDeViSE [15]\nCMT [47]\nDCN w/o ET\nDCN w/o E\n\nSUN CUB AwA aPY Avg\n39.5\n39.9\n53.2\n58.1\n53.7\n51.5\n51.2\n54.5\n33.7\n40.3\n36.4\n38.8\n47.5\n56.3\n56.5\n50.6\n35.5\n39.9\n54.4\n59.0\n61.8\n56.7\n\n33.8\n39.7\n32.9\n38.3\n8.3\n26.9\n23.9\n39.8\n28.0\n42.0\n43.6\n\n40.0\n54.9\n53.9\n53.9\n33.3\n34.3\n55.6\n52.0\n34.6\n53.2\n56.2\n\n44.1\n59.9\n65.6\n58.2\n53.0\n45.6\n54.0\n54.2\n39.5\n63.4\n65.2\n\nTable 3: Results of Standard Zero-Shot Learning using Proposed Split (PS) ResNet-101 features [54].\n\nimproves the accuracy from DCN w/o ET, the variant of DCN without using temperature calibration.\nThis validates the effectiveness of temperature calibration for mitigating the deep networks from\nover\ufb01tting the seen data of source classes and thus improving the zero-shot generalization capability.\n\nProposed Splits (PS) The results on Proposed Splits (PS) [32] of the four datasets are reported in\nTable 3, using ResNet-101 features and normalized attributes as class embeddings. DCN outperforms\nall previous methods substantially and consistently. (i) Our model performs better for datasets with\nrelatively large number of classes, e.g. SUN and CUB. We conjecture that the proposed cross-entropy\nloss is a native solution to multi-class classi\ufb01cation problems, which changes the loss design for\nzero-shot learning methods. (ii) More importantly, DCN outperforms DCN w/o ET, which validates\nthat the temperature calibration can lead to consistently improved zero-shot classi\ufb01cation accuracy.\n\n(a) \u03c4 = 0.3\n\n(b) \u03c4 = 1\n\nFigure 3: Probability histograms of DCN with (\u03c4 = 0.3) or without (\u03c4 = 1) temperature calibration.\n\n8\n\n0.51.0prediction probability0.02.55.07.510.0% of samples0.51.0prediction probability0204060% of samples\fResult Analysis We show in Figure 3 the histograms of probabilities output by our deep models\nwith (\u03c4 = 0.3) or without (\u03c4 = 1) temperature calibration on AwA test data from the 10 target classes.\nAs shown, DCN (\u03c4 = 1) produces absolute probabilities for test data from the target classes, implying\nundesirable overcon\ufb01dences for zero-shot learning. DCN (\u03c4 = 0.3), the variant with temperature\ncalibration, produces discriminative probabilities that is not prone to over\ufb01tting and overcon\ufb01dences.\n\nMethods\nDAP [32]\nALE [1]\nSJE [3]\nESZSL [42]\nConSE [36]\nSynC [8]\nDeViSE [15]\nCMT [47]\nZSKL [57]\nDCN w/o ET\nDCN w/o E\nDCN\n\nts\n4.2\n21.8\n14.7\n11.0\n6.8\n7.9\n16.9\n8.1\n20.1\n23.8\n23.9\n25.5\n\nSUN\n\ntr\n25.1\n33.1\n30.5\n27.9\n39.9\n43.3\n27.4\n21.8\n31.4\n36.1\n37.2\n37.0\n\nH\n7.2\n26.3\n19.8\n15.8\n11.6\n13.4\n20.9\n11.8\n24.5\n28.7\n29.1\n30.2\n\nts\n1.7\n23.7\n23.5\n12.6\n1.6\n11.5\n23.8\n7.2\n21.6\n26.5\n25.9\n28.4\n\nCUB\n\ntr\n67.9\n62.8\n59.2\n63.8\n72.2\n70.9\n53.0\n49.8\n52.8\n52.5\n65.8\n60.7\n\nH\n3.3\n34.4\n33.6\n21.0\n3.1\n19.8\n32.8\n12.6\n30.6\n35.2\n37.2\n38.7\n\nts\n0.0\n16.8\n11.3\n6.6\n0.4\n8.9\n13.4\n0.9\n18.9\n17.2\n17.2\n25.5\n\nAwA\n\ntr\n88.7\n76.1\n74.6\n75.6\n88.6\n87.3\n68.7\n87.6\n82.7\n84.7\n84.7\n84.2\n\nH\n0.0\n27.5\n19.6\n12.1\n0.8\n16.2\n22.4\n1.8\n30.8\n28.6\n28.6\n39.1\n\nts\n4.8\n4.6\n3.7\n2.4\n0.0\n7.4\n4.9\n1.4\n10.5\n8.3\n8.4\n14.2\n\naPY\ntr\n78.3\n73.7\n55.7\n70.1\n91.2\n66.3\n76.9\n85.2\n76.2\n73.0\n72.2\n75.0\n\nH\n9.0\n8.7\n6.9\n4.6\n0.0\n13.3\n9.2\n2.8\n18.5\n15.0\n15.1\n23.9\n\nTable 4: Results of Generalized Zero-Shot Learning on four datasets under Proposed Splits (PS) [54].\n\n4.3 Results of Generalized Zero-Shot Learning\n\nIn real applications, whether an image is from a source or target class is unknown in advance. Hence,\ngeneralized zero-shot learning is a more practical and dif\ufb01cult task. Here, we use the same models\ntrained on zero-shot learning setting on the Proposed Splits (PS) [54]. We evaluate our models on the\ngeneralized zero-shot learning setting, by evaluating performance on both source and target classes.\nThe results of generalized zero-shot learning are shown in Table 4, much lower than standard zero-\nshot learning. This is not surprising since the source classes are included in the search space which\nact as distractors for the images that come from target classes. As a result, methods performing better\non source classes (e.g. DAP and ConSE) often perform worse on target classes, and vice versa (e.g.\nALE and DeViSE). This reveals the indispensability of making models unblind to the target classes.\nThe DCN is to address this problem to yield much higher Harmonic accuracy ACCH and ACCts on all\ndatasets. By imposing a mild assumption that the semantic representation for target classes (cheap to\nhave) is available during training, DCN is made accessible to target classes, which bridges the source\nand target classes through both semantic representations and the seen data. Note that DCN performs\nbetter DCN w/o E (without entropy penalty), highlighting the ef\ufb01cacy of the uncertainty calibration.\n\n5 Conclusion\n\nThis paper proposed a deep calibration network towards generalized zero-shot learning. The approach\nenables simultaneous calibration of deep networks on the con\ufb01dence of source classes and uncertainty\nof target classes, and as a consequence, bridges the source and target classes through both semantic\nrepresentations of classes and visual embeddings of seen images. Experiments show that our approach\nyields state of the art performance for generalized zero-shot learning tasks on four benchmark datasets.\n\nAcknowledgments\n\nThis work was supported by the National Key R&D Program of China (2016YFB1000701), the\nNatural Science Foundation of China (61772299, 71690231, 61502265) and the DARPA Program on\nLifelong Learning Machines.\n\n9\n\n\fReferences\n[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classi\ufb01cation.\n\nIn CVPR, pages 819\u2013826, 2013.\n\n[2] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classi\ufb01cation. IEEE\n\nTPAMI, 38(7):1425\u20131438, 2016.\n\n[3] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for \ufb01ne-grained\n\nimage classi\ufb01cation. In CVPR, pages 2927\u20132936, 2015.\n\n[4] Z. Al-Halah and R. Stiefelhagen. How to transfer? zero-shot object recognition via hierarchical transfer of\n\nsemantic attributes. In WACV, pages 837\u2013843, 2015.\n\n[5] Y. Atzmon and G. Chechik. Probabilistic AND-OR attribute grouping for zero-shot learning. In UAI, 2018.\n[6] L. J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov. Predicting deep zero-shot convolutional neural\n\nnetworks using textual descriptions. In ICCV, pages 4247\u20134255, 2015.\n\n[7] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nTPAMI, 35(8):1798\u20131828, Aug 2013.\n\n[8] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classi\ufb01ers for zero-shot learning. In CVPR,\n\npages 5327\u20135336, 2016.\n\n[9] S. Changpinyo, W.-L. Chao, and F. Sha. Predicting visual exemplars of unseen classes for zero-shot\n\nlearning. In ICCV, 2017.\n\n[10] W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha. An empirical study and analysis of generalized zero-shot\n\nlearning for object recognition in the wild. In ECCV, pages 52\u201368, 2016.\n\n[11] M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classi\ufb01er: Zero-shot learning using purely textual\n\ndescriptions. In ICCV, pages 2584\u20132591, 2013.\n\n[12] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, pages\n\n1778\u20131785, 2009.\n\n[13] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, pages\n\n1778\u20131785, 2009.\n\n[14] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE TPAMI, 28(4):594\u2013611,\n\n2006.\n\n[15] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep\nvisual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.\nWeinberger, editors, NIPS, pages 2121\u20132129. 2013.\n\n[16] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Learning multimodal latent attributes. IEEE TPAMI,\n\n36(2):303\u2013316, 2014.\n\n[17] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. IEEE TPAMI,\n\n37(11):2332\u20132345, 2015.\n\n[18] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. IEEE TPAMI,\n\n37(11):2332\u20132345, 2015.\n\n[19] Y. Fu and L. Sigal. Semi-supervised vocabulary-informed learning. In CVPR, pages 5337\u20135346, 2016.\n[20] Z. Fu, T. Xiang, E. Kodirov, and S. Gong. Zero-shot object recognition by semantic manifold distance. In\n\nCVPR, pages 2635\u20132644, 2015.\n\n[21] E. Gavves, T. Mensink, T. Tommasi, C. G. Snoek, and T. Tuytelaars. Active transfer learning with zero-shot\n\npriors: Reusing past datasets for future tasks. In ICCV, pages 2731\u20132739, 2015.\n\n[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[23] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML,\n\n2017.\n\n[24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.\n[25] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[26] D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.\n[27] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Unsupervised domain adaptation for zero-shot learning. In\n\nICCV, pages 2452\u20132460, 2015.\n\n[28] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. In CVPR, 2017.\n[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[30] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classi\ufb01ers for face\n\nveri\ufb01cation. In ICCV, pages 365\u2013372, 2009.\n\n[31] V. Kumar Verma, G. Arora, A. Mishra, and P. Rai. Generalized zero-shot learning via synthesized examples.\n\nIn CVPR, June 2018.\n\n[32] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classi\ufb01cation for zero-shot visual object\n\ncategorization. IEEE TPAMI, 36(3):453\u2013465, 2014.\n\n10\n\n\f[33] Z. Li, E. Gavves, T. Mensink, and C. G. Snoek. Attributes make sense on segmented objects. In ECCV,\n\npages 350\u2013365, 2014.\n\n[34] T. Mensink, E. Gavves, and C. G. Snoek. Costa: Co-occurrence statistics for zero-shot classi\ufb01cation. In\n\nCVPR, pages 2441\u20132448, 2014.\n\n[35] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in vector space.\n\nCoRR, abs/1301.3781, 2013.\n\n[36] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot\n\nlearning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.\n\n[37] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output\n\ncodes. In NIPS, pages 1410\u20131418, 2009.\n\n[38] D. Parikh and K. Grauman. Relative attributes. In ICCV, pages 503\u2013510, 2011.\n[39] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes.\n\nIn CVPR, pages 2751\u20132758, 2012.\n\n[40] M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a\n\nlarge-scale setting. In CVPR, pages 1641\u20131648, 2011.\n\n[41] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps where\u2013and why? semantic\n\nrelatedness for knowledge transfer. In CVPR, pages 910\u2013917, 2010.\n\n[42] B. Romera-Paredes and P. H. Torr. An embarrassingly simple approach to zero-shot learning. In ICML,\n\npages 2152\u20132161, 2015.\n\n[43] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel distributed processing: Explorations in the\nmicrostructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation. MIT\nPress, 1986.\n\n[44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV,\n115(3):211\u2013252, 2015.\n\n[45] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition. IEEE\n\nTPAMI, 35(7):1757\u20131772, 2013.\n\n[46] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[47] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In\n\nNIPS, pages 935\u2013943, 2013.\n\n[48] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. In CVPR, 2015.\n\n[49] Y. H. Tsai, L. Huang, and R. Salakhutdinov. Learning robust visual-semantic embeddings. In ICCV, pages\n\n3591\u20133600, 2017.\n\n[50] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset.\n\n2011.\n\n[51] X. Wang and Q. Ji. A uni\ufb01ed probabilistic approach modeling relationships between attributes and objects.\n\nIn ICCV, pages 2120\u20132127, 2013.\n\n[52] Z. Wu, Y. Fu, Y.-G. Jiang, and L. Sigal. Harnessing object and scene semantics for large-scale video\n\nunderstanding. In CVPR, pages 3112\u20133121, 2016.\n\n[53] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In CVPR,\n\nJune 2018.\n\n[54] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning - the good, the bad and the ugly. In CVPR, July 2017.\n[55] Y. Yang and T. M. Hospedales. A uni\ufb01ed perspective on multi-domain and multi-task learning. In ICLR,\n\n2015.\n\n[56] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang. Designing category-level attributes for\n\ndiscriminative visual recognition. In CVPR, pages 771\u2013778, 2013.\n\n[57] H. Zhang and P. Koniusz. Zero-shot kernel learning. In CVPR, June 2018.\n[58] L. Zhang, T. Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. In CVPR,\n\n2017.\n\n[59] Z. Zhang and V. Saligrama. Classifying unseen instances by learning class-independent similarity functions.\n\narXiv preprint arXiv: 1511.04512, 2015.\n\n[60] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, pages\n\n4166\u20134174, 2015.\n\n[61] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition\n\nusing places database. In NIPS, pages 487\u2013495, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1008, "authors": [{"given_name": "Shichen", "family_name": "Liu", "institution": "Tsinghua University"}, {"given_name": "Mingsheng", "family_name": "Long", "institution": "Tsinghua University"}, {"given_name": "Jianmin", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}