{"title": "Transfer Learning in a Transductive Setting", "book": "Advances in Neural Information Processing Systems", "page_first": 46, "page_last": 54, "abstract": "Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels however is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three main ingredients. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expert-specified information, e.g., by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm - so far only used for semi-supervised learning - to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets.", "full_text": "Transfer Learning in a Transductive Setting\n\nMarcus Rohrbach\n\nSandra Ebert\n\nBernt Schiele\n\nMax Planck Institute for Informatics, Saarbr\u00a8ucken, Germany\n{rohrbach,ebert,schiele}@mpi-inf.mpg.de\n\nAbstract\n\nCategory models for objects or activities typically rely on supervised learning\nrequiring suf\ufb01ciently large training sets. Transferring knowledge from known cat-\negories to novel classes with no or only a few labels is far less researched even\nthough it is a common scenario. In this work, we extend transfer learning with\nsemi-supervised learning to exploit unlabeled instances of (novel) categories with\nno or only a few labeled instances. Our proposed approach Propagated Semantic\nTransfer combines three techniques. First, we transfer information from known to\nnovel categories by incorporating external knowledge, such as linguistic or expert-\nspeci\ufb01ed information, e.g., by a mid-level layer of semantic attributes. Second,\nwe exploit the manifold structure of novel classes. More speci\ufb01cally we adapt a\ngraph-based learning algorithm \u2013 so far only used for semi-supervised learning \u2013\nto zero-shot and few-shot learning. Third, we improve the local neighborhood\nin such graph structures by replacing the raw feature-based representation with a\nmid-level object- or attribute-based representation. We evaluate our approach on\nthree challenging datasets in two different applications, namely on Animals with\nAttributes and ImageNet for image classi\ufb01cation and on MPII Composites for ac-\ntivity recognition. Our approach consistently outperforms state-of-the-art transfer\nand semi-supervised approaches on all datasets.\n\n1\n\nIntroduction\n\nWhile supervised training is an integral part of building visual, textual, or multi-modal category\nmodels, more recently, knowledge transfer between categories has been recognized as an important\ningredient to scale to a large number of categories as well as to enable \ufb01ne-grained categorization.\nThis development re\ufb02ects the psychological point of view that humans are able to generalize to\nnovel1 categories with only a few training samples [17, 1]. This has recently gained increased\ninterest in the computer vision and machine learning literature, which look at zero-shot recognition\n(with no training instances for a class) [11, 19, 9, 22, 16], and one- or few-shot recognition [29, 1,\n21]. Knowledge transfer is particularly bene\ufb01cial when scaling to large numbers of classes [23, 16],\ndistinguishing \ufb01ne-grained categories [6], or analyzing compositional activities in videos [9, 22].\nRecognizing categories with no or only few labeled training instances is challenging. To improve ex-\nisting transfer learning approaches, we exploit several sources of information. Our approach allows\nusing (1) trained category models, (2) external knowledge, (3) instance similarity, and (4) labeled\ninstances of the novel classes if available. More speci\ufb01cally we learn category or attribute models\nbased on labeled training data for known categories y (see also Figure 1) using supervised training.\nThese trained models are then associated with the novel categories z using, e.g. expert or automati-\ncally mined semantic relatedness (cyan lines in Figure 1). Similar to unsupervised learning [32, 28]\nour approach exploits similarities in the data space via a graph structure to discover dense regions\nthat are associated with coherent categories or concepts (orange graph structure in Figure 1). How-\never, rather than using the raw input space, we map our data into a semantic output space with the\n\n1We use \u201cnovel\u201d throughout the paper to denote categories with no or few labeled training instances.\n\n1\n\n\fFigure 1: Conceptual visualisation of our approach Propagated Semantic Transfer. Known cate-\ngories y, novel categories z, instances x (colors denote predicted category af\ufb01liation). Qualitative\nresults can be found in supplemental material and on our website.\n\nmodels trained on the known classes (pink arrow) to bene\ufb01t from their discriminative knowledge.\nGiven the uncertain predictions and the graph structure we adapt semi-supervised label propaga-\ntion [34, 33] to generate more reliable predictions. If labeled instances are available they can be\nseamlessly added. Note, attribute or category models do not have to be retrained if novel classes are\nadded which is an important aspect e.g. in a robotic scenario.\nThe main contribution of this work is threefold. First, we propose a novel approach that extends\nsemantic knowledge transfer to the transductive setting, exploiting similarities in the unlabeled data\ndistribution. The approach allows to do zero-shot recognition but also smoothly integrate labels for\nnovel classes (Section 3). Second, we improve the local neighborhood structure in the raw feature\nspace by mapping the data into a low dimensional semantic output space using the trained attribute\nand category models. Third, we validate our approach on three challenging datasets for two differ-\nent applications, namely on Animals with Attributes and ImageNet for image classi\ufb01cation and on\nMPII Composites for activity recognition (Section 4). We also provide a discussion of related work\n(Section 2) and conclusions for future work (Section 5). The implementation for our Propagated\nSemantic Transfer and code to easily reproduce the results in this paper is available on our website.\n\n2 Related work\n\nKnowledge transfer or transfer learning has the goal to transfer information of learned models to\nchanging or unknown data distributions while reducing the need and effort to collect new training\nlabels. It refers to a variety of tasks, including domain adaptation [25] or sharing of knowledge and\nrepresentations [30, 3] (a recent categorization can be found in [20]).\nIn this work we focus on transferring knowledge from known categories with suf\ufb01cient training\ninstances to novel categories with limited training data. In computer vision or machine learning\nliterature this setting is normally referred to as zero-shot learning [11, 19, 24, 9, 16] if there are no\ninstances for the test classes available and one- or few-shot learning [16, 9, 8] if there are one or few\ninstances available for the novel classes.\nTo recognize novel categories zero-shot recognition uses additional information, typically in the\nform of an intermediate attribute representation [11, 9], direct similarity [24] between categories, or\nhierarchical structures of categories [35]. The information can either be manually speci\ufb01ed [11, 9]\nor mined automatically from knowledge bases [24, 22]. Our approach builds on these works by\nusing a semantic knowledge transfer approach as the \ufb01rst step. If one or a few training examples are\navailable, these are typically used to select or adapt known models [1, 9, 26]. In contrast to related\nwork, our approach uses the above mentioned semantic knowledge transfer also when few training\nexamples are available to reduce the dependency on the quality of the samples. Also, we still use\nthe labeled examples to propagate information.\nAdditionally, we exploit the neighborhood structure of the unlabeled instances to improve recogni-\ntion for zero- and few-shot recognition. This is in contrast to previous works with the exception of\n\n2\n\nz2 z3 z1 z3 z2 z1 x12 y1 y3 y4 y5 y2 known classes x11 z3 z2 z1 x1 x2 x5 x4 x3 x6 x8 x9 x7 x10 x12 x13 x14 x15 z3 z2 z1 x1 x2 x4 x3 x11 x6 x9 x12 x13 x14 x14 x15 x13 x10 x7 x6 x9 x8 x3 x2 x1 x5 x1 x2 x4 x3 x11 x6 x9 x12 x13 x14 x14 x15 x13 x10 x7 x6 x9 x8 x3 x2 x1 x5 x4 x11 x12 x11 x1 x2 x5 x4 x3 x6 x8 x9 x7 x10 x13 x14 x15 x4 Semantic knowledge transfer Few labeled instances Instance similarity object/attribute classifier scores to estimate instance similarity Improved prediction + = + external  knowledge \fthe zero-shot approach of [9] that learns a discriminative, latent attribute representation and applies\nself-training on the unseen categories. While conceptually similar, the approach is different to ours,\nas we explicitly use the local neighborhood structure of the unlabeled instances. A popular choice to\nintegrate local neighborhood structure of the data are graph-based methods. These have been used to\ndiscover a grouping by spectral clustering [18, 14], and to enable semi-supervised learning [34, 33].\nOur setting is similar to the semi-supervised setting. To transfer labels from labeled to unlabeled\ndata label propagation is widely used [34, 33] and has shown to work successfully in several appli-\ncations [13, 7]. In this work, we extend transfer learning by considering the neighborhood structure\nof the novel classes. For this we adapt the well-known label propagation approach of [33]. We build\na k-nearest neighbor graph to capture the underlying manifold structure as it has shown to provide\nthe most robust structure [15]. Nevertheless, the quality of the graph structure is key to success of\ngraph-based methods and strongly dependents on the feature representation [5]. We thus improve\nthe graph structure by replacing the noisy raw input space with the more compact semantic output\nspace which has shown to improve recognition [26, 22].\nTo improve image classi\ufb01cation with reduced training data, [4, 27] use attributes as an intermediate\nlayer and incorporate unlabeled data, however, both works are in a classical semi-supervised learn-\ning setting similar to [5], while our setting is transfer learning. More speci\ufb01cally [27] propose to\nbootstrap classi\ufb01ers by adding unlabeled data. The bootstrapping is constrained by attributes shared\nacross classes. In contrast, we use attributes for transfer and exploit the similarity between instances\nof the novel classes. [4] automatically discover a discriminative attribute representation, while in-\ncorporating unlabeled data. This notion of attributes is different to ours as we want to use semantic\nattributes to enable transfer from other classes. Other directions to improve the quality of the inter-\nmediate representation include integrating metric learning [31, 16] or online methods [10] which we\ndefer to future work.\n\n3 Propagated Semantic Transfer (PST)\n\nOur main objective is to robustly recognize novel categories by transferring knowledge from known\nclasses and exploiting the similarity of the test instances. More speci\ufb01cally our novel approach called\nPropagated Semantic Transfer consists of the following four components: we employ semantic\nknowledge transfer from known classes to novel classes (Sec. 3.1); we combine the transferred\npredictions with labels for the novel classes (Sec. 3.2); a similarity metric is de\ufb01ned to achieve a\nrobust graph structure (Sec. 3.3); we propagate this information within the novel classes (Sec. 3.4).\n\n3.1 Semantic knowledge transfer\n\nWe \ufb01rst transfer knowledge using a semantic representation. This allows to include external know-\nledge sources. We model the relation between a set of K known classes y1, . . . , yK to the set of\n\nN novel classes z1, . . . , zN . Both sets are disjoint, i.e. {y1, . . . , yK}(cid:84){z1, . . . , zN} = \u2205. We use\n\ntwo strategies to achieve this transfer: i) an attribute representation that employs an intermediate\nrepresentation of a1, . . . , aM attributes or ii) direct similarities calculated among the known object\nclasses. Both work without any training examples for zn, i.e. also for zero-shot recognition [11, 24].\ni) Attribute representation. We use the Direct-Attribute-Prediction (DAP) model [11], using\nour formulation [24]. An intermediate level of M attribute classi\ufb01ers p(am|x) is trained on the\nknown classes yk to estimate the presence of attribute am in the instance x. The subsequent\nknowledge transfer requires an external knowledge source that provides class-attribute associations\nm \u2208 {0, 1} indicating if attribute am is associated with class zn. Options for such association\nazn\ninformation are discussed in Section 4.2. Given this information the probability of the novel classes\nzn to be present in the instance x can then be estimated [24]:\n\n(2p(am|x))azn\nm .\n\n(1)\n\nii) Direct similarity. As an alternative to attributes, we can use the U most similar training classes\ny1, ..., yU as a predictor for novel class zn given an instance x [24]:\n\n(2p(yu|x))yzn\nu ,\n\n(2)\n\np(zn|x) \u221d M(cid:89)\np(zn|x) \u221d U(cid:89)\n\nm=1\n\nu=1\n\n3\n\n\fwhere yzn\nu provides continuous normalized weights for the strength of the similarity between the\nnovel class zn and the known class yu [24]. To comply with [23, 22] we slightly diverge from these\nmodels for the ImageNet and MPII Composites dataset by using a sum formulation instead of the\n, and for direct similarity\n\nprobabilistic expression, i.e. for attributes p(zn|x) \u221d (cid:80)M\n(cid:80)M\np(zn|x) \u221d (cid:80)U\n\n. Note that in this case we do not obtain probability estimates, however, for\n\nu=1 p(yu|x)\n\nm=1 azn\n\nm p(am|x)\nm=1 azn\n\nm\n\nlabel propagation the resulting scores are suf\ufb01cient.\n\nU\n\n3.2 Combining transferred and ground truth labels\n\nIn the following we treat the multi-class problem as N binary problems, where N is the number\nof binary classes. For class zn the semantic knowledge transfer provides p(zn|x) \u2208 [0, 1] for all\ninstances x. We combine the best predictions per class, scaled to [\u22121, 1], with labels \u02c6l(zn|x) \u2208\n{\u22121, 1} provided for some instances x in the following way:\n\nl(zn|x) =\n\n(1 \u2212 \u03b3)(2p(zn|x) \u2212 1)\n0\n\nif there is a label for x\nif p(zn|x)is among top-\u03b4 fraction of predictions for zn\notherwise.\n\n(3)\n\n\u03b3 provides a weighting between the true labels and the predicted labels. In the zero-shot case we\nonly use predictions, i.e. \u03b3 = 0. The parameters \u03b4, \u03b3 \u2208 [0, 1] are chosen, similar to the remaining\nparameters, using cross-validation on the training set.\n\n3.3 Similarity metric based on discriminative models for graph construction\n\nWe enhance transfer learning by exploiting also the neighborhood structure within novel classes,\ni.e. we assume a transductive setting. Graph-based semi-supervised learning incorporates this infor-\nmation by employing a graph structure over all instances. In this section we describe how to improve\nthe graph structure as it has a strong in\ufb02uence on the \ufb01nal results [5]. The k-NN graph is usually\nbuilt on the raw feature descriptors of the data. Distances are computed for each pair (xi, xj) by\n\n\uf8f1\uf8f2\uf8f3\u03b3\u02c6l(zn|x)\n\nd(xi, xj) =\n\n|xi,d \u2212 xj,d|,\n\n(4)\n\nwhere D is the dimensionality of the raw feature space. We note that the visual representation used\nfor label propagation can be independent of the visual representation used for transfer. While the\nvisual representation for transfer is required to provide good generalization abilities in conjunc-\ntion with the employed supervised learning strategy, the visual representation for label propagation\nshould induce a good neighborhood structure. Therefore we propose to use the more compact output\nspace trained on the known classes which we found to provide a much better structure, see Figure\n5b. We thus compute the distances either on the M-dimensional vector of the attribute classi\ufb01ers\np(am|x) with M (cid:28) D, i.e.,\n\nd(xi, xj) =\n\n|p(am|xi) \u2212 p(am|xj)|,\n\n(5)\n\nor on the K-dimensional vector of object-classi\ufb01ers p(yk|x) with K (cid:28) D, i.e.\n\nD(cid:88)\n\nd=1\n\nM(cid:88)\n\nm=1\n\nK(cid:88)\n\n\u03ba=1\n\nd(xi, xj) =\n\n|p(y\u03ba|xi) \u2212 p(y\u03ba|xj)|.\n\n(6)\n\n(cid:16)\u2212d(xi,xj )\n\n(cid:17)\n\n.\n\n2\u03c32\n\nThese distances are transformed into similarities with a RBF kernel: s(xi, xj) = exp\nFinally, we construct a k-NN graph that is known for its good performance [15, 5], i.e.,\n\n(cid:26)s(xi, xj)\n\nWij =\n\n0\n\nif s(xi, xj) is among the k largest similarities of xi\notherwise.\n\n(7)\n\n4\n\n\fFigure 2: AwA (left), ImageNet (middle), and MPII Composite Activities (right)\n\n3.4 Label propagation with certain and uncertain labels\n\nIn this work, we build upon the label propagation by [33]. The k-NN graph with RBF kernel gives\nthe weighted graph W (see Section 3.3). Based on this graph we compute a normalized graph\nLaplacian, i.e., S = D\u22121/2W D\u22121/2 with the diagonal matrix D summing up the weights in each\nrow in W . Traditional semi-supervised label propagation uses sparse ground truth labels. In contrast\nwe have dense labels l(zn|x) which are a combination of uncertain predictions and certain labels (see\nEq. 3) for all instances {x1, . . . , xi} of the novel classes zn. Therefore, we modify the initialization\nby setting\n\n(8)\nfor the N novel classes. For each class, labels are propagated through this graph structure converging\nto the following closed form solution\n\nn = [l(zn|x1), . . . , l(zn|xi)]\nL(0)\n\nn = (I \u2212 \u03b1S)\u22121L(0)\nL\u2217\n\nfor 1 \u2264 n \u2264 N,\n\n(9)\nwith the regularization parameter \u03b1 \u2208 (0, 1]. The resulting framework makes use of the manifold\nstructure underlying the novel classes to regulate the predictions from transfer learning. In general,\nthe algorithm converges after few iterations.\n\nn\n\n4 Evaluation\n\n4.1 Datasets\n\nWe shortly outline the most important properties of the examined datasets in the following para-\ngraphs and show example images/frames in Figure 2.\nAwA The Animals with Attributes dataset (AwA) [11] is one of the \ufb01rst and most widely used\ndatasets for semantic knowledge transfer and zero-shot recognition.\nIt consists of 50 mammal\nclasses, 40 training (24,395 images) and 10 disjoint test classes (6,180 images). We use the pro-\nvided pre-computed 6 image descriptors, which are concatenated.\nImageNet The ImageNet 2010 challenge [2] requires large scale and \ufb01ne-grained recognition. It\nconsists of 1000 image categories which are split into 800 training and 200 test categories according\nto [23]. We use the LLC and Fisher-Vector encoded SIFT descriptors provided by [23].\nMPII Composite Activities The MPII Composite Cooking Activities dataset [22] distinguishes 41\nbasic cooking activities, such as prepare scrambled egg or prepare carrots with video recordings\nof varying length from 1 to 41 minutes. It consists of a total of 256 videos, 44 are used for train-\ning the attribute representation, 170 are used as test data. We use the provided dense-trajectory\nrepresentation and train/test split.\n\n4.2 External knowledge sources and similarity measures\n\nOur approach incorporates external knowledge to enable semantic knowledge transfer from known\nclasses y to unseen classes z. We use the class-attribute associations azn\nm for attribute-based transfer\n(Equation 1) or inter-class similarity yzn\nu for direct-similarity-based transfer (Equation 2) provided\nwith the datasets. In the following we shortly outline the knowledge sources and measures.\nManual (AwA) AwA is accompanied with a set of 85 attributes and associations to all 40 training\nand all 10 test classes. The associations are provided by human judgments [11].\nHierarchy (ImageNet) For ImageNet the manually constructed WordNet/ImagNet hierarchy is used\nto \ufb01nd the most similar of the 800 known classes (leaf nodes in the hierarchy). Furthermore, the 370\ninner nodes can group several classes into attributes [23].\n\n5\n\n\fApproach\nDAP [11]\nIAP [11]\nZero-Shot Learning [9]\nPST (ours)\non image descriptors\non attributes\n\nPerformance\nAUC Acc.\n41.4\n81.4\n80.0\n42.2\n41.3\nn/a\n\n81.2\n83.7\n\n40.5\n42.7\n\n(a) Zero-Shot. Predictions with attributes and\nmanual de\ufb01ned associations, in %.\n\n(b) Few-Shot\n\nFigure 3: Results on AwA Dataset, see Sec. 4.3.1.\n\nLinguistic knowledge bases (AwA, ImageNet) An alternative to manual association are automati-\ncally mined associations. We use the provided similarity matrices which are extracted using different\nlinguistic similarity measures. They are either based on linguistic corpora, namely Wikipedia and\nWordNet, or on hit-count statistics of web search. One can distinguish basic web search (Yahoo\nWeb), web search re\ufb01ned to part associations (Yahoo Holonyms), image search (Yahoo Image and\nFlickr Image), or use the information of the summary snippets returned by web search (Yahoo Snip-\npets). As ImageNet does not provide attributes, we mined 811 part-attributes from the associated\nWordNet hierarchy [23].\nScript data (MPII Composites) To associate composite cooking activities such as preparing car-\nrots with attributes of \ufb01ne-grained activities (e.g. wash, peel), ingredients (e.g. carrots), and tools\n(e.g. knife, peeler), textual description (Script data) of these activities were collected with AMT. The\nprovided associations are computed based on either the frequency statistics or, more discriminate,\nby term frequency times inverse document frequency (tf*idf ). Words in the text can be matched to\nlabels either literally or by using WordNet expansion [22].\n\n4.3 Results\n\nTo enable a direct comparison, we closely follow the experimental setups of the respective datasets\n[11, 23, 22]. On all datasets we train attribute or object classi\ufb01ers (for direct similarity) with one-vs-\nall SVMs using Mean Stochastic Gradient Descent [23] and, for AwA and MPII Composites, with a\n\u03c72 kernel approximation as in [22]. To get more distinctive representations for label propagation we\ntrain sigmoid functions [12] to estimate probabilities (on the training set for AwA/MPII Composites\nand on the validation set for ImageNet).\nThe hyper-parameters of our new Propagated Semantic Transfer algorithm are estimated using 5-\nfold cross-validation on the respective training set, splitting them into 80% known and 20% novel\nclasses: We determine the parameters for our approach on the AwA training set and then set them\nfor all datasets to \u03b1 = 0.8, \u03b3 = 0.98, the number of neighbors k = 50, the number of iterations for\npropagation to 10, and use L1 distance. Due to the different recognition precision of the datasets\nwe determine \u03b4 = 0.15/0.04 separately for AwA/ImageNet. For MPII Composites we only do\nzero-shot recognition and use all samples due to the limited number of samples of \u2264 7 per class.\nFor few-shot recognition we report the mean over 10 runs where we pick examples randomly. The\nlabeled examples are included in the evaluation to make it comparable to the zero-shot case.\nWe validate our claim that the classi\ufb01er output space induces a better neighborhood structure than\nthe raw features by examining the k-Nearest-Neighbour (kNN) quality for both. In Figure 5b we\ncompare the kNN quality on two datasets (see Sec. 4.1) for both feature representation. We observe\nthat the attribute (Eq. 5) and object (Eq. 6) classi\ufb01er-based representations (green and magenta\ndashed line) achieve a signi\ufb01cantly higher accuracy than the respective raw feature-based represen-\ntation (Eq. 4, Fig. 5b solid lines). We note that a good kNN-quality is required but not suf\ufb01cient for\ngood propagation, as it also depends on the distribution and quality of initial predictions. In the fol-\nlowing, we compare the performance of the raw features with the attribute classi\ufb01er representation.\n\n6\n\n010203040503035404550# training samples per classmean Acc in %  PST (ours) \u2212 manual def. ass.LP + attr. classifiers \u2212 manual ass.PST (ours) \u2212 Yahoo Image attr.LP + attr. classifiers \u2212 Yahoo Img attr.LP [5]\f(a) Zero-Shot.\n\n(b) Few-Shot.\n\nFigure 4: Results on ImageNet, see Sec. 4.3.2.\n\n4.3.1 AwA - image classi\ufb01cation\n\nWe start by comparing the performance of related work to our approach on AwA (see Sec. 4.1) in\nFigure 3. We start by examining the zero-shot results in Figure 3a, where no training examples\nare available for the novel or in this case unseen classes. The best results to our knowledge for\non this dataset are reported by [11]. On this 10-class zero-shot task they achieve 81.4% area un-\nder ROC-curve (AUC) and 41.4% multi-class accuracy (Acc) with DAP, averaged over the 10 test\nclasses. Additionally we report results from Zero-Shot Learning [9] which achieves 41.3% Acc. Our\nPropagated Semantic Transfer, using the raw image descriptors to build a neighborhood structure,\nachieves 81.2% AUC and 40.5% Acc. However, when propagating on the 85-dimensional attribute\nspace, we improve over [11] and [9] to 83.7% AUC and 42.7% Acc. To understand the difference\nin performance between the attribute and the image descriptor space we examine the neighborhood\nquality used for propagating labels shown in Figure 5b. The k-NN accuracy, measured on the ground\ntruth labels, is signi\ufb01cantly higher for the attribute space (green dashed curve) compared to the raw\nfeatures (solid green). The information is more likely propagated to neighbors of the correct class\nfor the attribute-space leading to a better \ufb01nal prediction. Another advantage is the signi\ufb01cantly\nreduced computation and storage costs for building the k-NN graph which scales linearly with the\ndimensionality. We believe that such an intermediate space, in this case represented by attributes,\nmight provide a better neighborhood structure and could be used in other label-propagation tasks.\nNext we compare our approach in the few-shot setting, i.e. we add labeled examples per class. In\nFigure 3b we compare our approach (PST) to two label propagation (LP) baselines. We \ufb01rst note\nthat PST (red curves) seamlessly moves from zero-shot to few-shot, while traditional LP (blue and\nblack curves) needs at least one training example. We \ufb01rst examine the three solid lines. The black\ncurve is our best LP variant from [5] evaluated on the 10 test classes of AwA rather than all 50\nas in [5]. We also compute LP in combination with the similarity metric based on the attribute\nclassi\ufb01er scores (blue curves). This transfer of knowledge residing in the classi\ufb01er trained on the\nknown classes already gives a signi\ufb01cant improvement in performance. Our approach (red curve)\nadditionally transfers labels from the known classes and improves further. Especially for few labels\nour approach bene\ufb01ts from the transfer, e.g. for 5 labeled samples per class PST achieves 43.9%\naccuracy, compared to 38.1% for LP with attribute classi\ufb01ers and 32.2% for [5]. For less samples\nLP drops signi\ufb01cantly while our approach has nearly stable performance. For large amounts of\ntraining data, PST approaches - as expected - LP (red vs. blue in Figure 3b).\nThe dashed lines in Figure 3b provide results for automatically mined associations azn\nm between\nattributes and classes. It is interesting to note that these automatically mined associations achieve\nperformance very close to the manual de\ufb01ned associations (dashed vs. solid). In this plot we use\nYahoo Image as base for the semantic relatedness, but we also provide the improvements of PST for\nthe other linguistic language sources in supplemental material.\n\n4.3.2 ImageNet - large scale image classi\ufb01cation\n\nIn this section we evaluate our Propagated Semantic Transfer approach on a large image classi\ufb01ca-\ntion task with 200 unseen image categories using the setup as proposed by [23]. We report the top-5\naccuracy2 [2] which requires one of the best \ufb01ve predictions for an image to be correct.\n\n2top-5 accuracy = 1 - top-5 error as de\ufb01ned in [2]\n\n7\n\n0102030Hierachy \u2212 leaf nodesHierachy \u2212 inner nodesAttributes \u2212 WikipediaAttributes \u2212 Yahoo HolonymsAttributes \u2212 Yahoo ImageAttributes \u2212 Yahoo SnippetsDirect similarity \u2212 WikipediaDirect similarity \u2212 Yahoo WebDirect similarity \u2212 Yahoo ImageDirect similarity \u2212 Yahoo Snippetstop\u22125 accuracy (in %)  [23]PST (ours)0510152030354045505560# training samples per classtop\u22125 accuracy (in %)  PST (ours) \u2212 Hierachy (inner nodes)PST (ours) \u2212 Yahoo Img directLP + object classifiers\f(a) MPII Composite Activities, see Sec. 4.3.3.\n\n(b) Accuracy of the majority vote from\nkNN (kNN-Classi\ufb01er) on test sets\u2019 ground truth.\n\nFigure 5: Results\n\nResults are reported in Figure 4. For zero-shot recognition our PST (red bars) improves performance\nover [23] (black bars) as shown in Figure 4a. The largest improvement in top-5 accuracy is achieved\nfor Yahoo Image with Attributes which increases by 6.7% to 25.3%. The absolute performance of\n34.0% top-5 accuracy is achieved by using the inner nodes of the WordNet hierarchy for transfer,\nclosely followed by Yahoo Web with direct similarity, achieving 33.1% top-5 accuracy. Similar to\nthe AwA dataset we improve PST over the LP-baseline for few-shot recognition (Figure 4b).\n\n4.3.3 MPII composite - activity recognition\n\nIn the last two subsections, we showed the bene\ufb01t of Propagated Semantic Transfer on two image\nclassi\ufb01cation challenges. We now evaluate our approach on the video-activity recognition dataset\nMPII Composite Cooking Activities [22]. We compute mean AP using the provided features and\nfollow the setup of [22].\nIn Figure 5a we compare our performance (red bars) to the results of\nzero-shot recognition without propagation [22] (black bars) for four variants of Script data based\ntransfer. Our approach achieves signi\ufb01cant performance improvements in all four cases, increasing\nmean AP by 11.1%, 10.7%, 12.0%, and 7.7% to 34.0%, 32.8%, 34.4%, and 29.2%, respectively.\nThis is especially impressive as it reaches the level of supervised training: for the same set of\nattributes (and very few, \u2264 7 training categories per class) [22] achieve 32.2% for SVM, 34.6%\nfor NN-classi\ufb01cation, and up to 36.2% for a combination of NN with script data.\nWe \ufb01nd these results encouraging as it is much more dif\ufb01cult to collect and label training exam-\nples for this domain than for image classi\ufb01cation and the complexity and compositional nature of\nactivities frequently requires recognizing unseen categories [9].\n\n5 Conclusion\n\nIn this work we address a frequently occurring setting where there is large amount of training data\nfor some classes, but other, e.g. novel classes, have no or only few labeled training samples. We\npropose a novel approach named Propagated Semantic Transfer, which integrates semantic knowl-\nedge transfer with the visual similarities of unlabeled instances within the novel classes. We adapt a\nsemi-supervised label-propagation approach by building the neighborhood graph on expressive, low-\ndimensional semantic output space and by initializing it with predictions from knowledge transfer.\nWe evaluated this approach on three diverse datasets for image and video-activity recognition,\nconsistently improving performance over the state-of-the-art for zero-shot and few-shot prediction.\nMost notably we achieve 83.7% AUC / 42.7% multi-class accuracy on the Animals with Attributes\ndataset for zero-shot recognition, scale to 200 unseen classes on ImageNet, and achieve up to 34.4%\n(+12.0%) mean AP on MPII Composite Activities which is on the level of supervised training on this\ndataset. We show that our approach consistently improves performance independent of factors such\nas (1) the speci\ufb01c datasets and descriptors, (2) different transfer approaches: direct vs. attributes,\n(3) types of transfer association: manually de\ufb01ned, linguistic knowledge bases, or script data, (4)\ndomain: image and video activity recognition, or (5) model: probabilistic vs. sum formulation.\nAcknowledgements. This work was partially funded by the DFG project SCHI989/2-2.\n\n8\n\n010203040Script data, freqs\u2212literalScript data, freqs\u2212WNScript data, tf*idf\u2212literalScript data, tf*idf\u2212WNmean AP (in %)  [22]PST (ours)0204060801000204060k nearest neighoursaccuracy in %  AwA \u2212 attribute classifiersAwA \u2212 raw featuresImageNet \u2212 object classifiersImageNet \u2212 raw features\fReferences\n[1] E. Bart & S. Ullman. Single-example learning of novel classes using representation by similarity.\n\nIn\n\nBMVC, 2005.\n\n[2] A. Berg, J. Deng, & L. Fei-Fei. ILSVRC 2010. www.image-net.org/challenges/LSVRC/2010/, 2010.\n[3] U. Blanke & B. Schiele. Remember and transfer what you have learned - recognizing composite activities\n\nbased on activity spotting. In ISWC, 2010.\n\n[4] J. Choi, M. Rastegari, A. Farhadi, & L. S. Davis. Adding Unlabeled Samples to Categories by Learned\n\n[5] S. Ebert, D. Larlus, & B. Schiele. Extracting Structures in Image Collections for Object Recognition. In\n\nAttributes. In CVPR, 2013.\n\nECCV, 2010.\n\n[6] R. Farrell, O. Oza, V. Morariu, T. Darrell, & L. S. Davis. Birdlets: Subordinate categorization using\n\nvolumetric primitives and pose-normalized appearance. In ICCV, 2011.\n\n[7] R. Fergus, Y. Weiss, & A. Torralba. Semi-supervised learning in gigantic image collections. NIPS 2009.\n[8] M. Fink. Object classi\ufb01cation from a single example utilizing class relevance pseudo-metrics. In NIPS,\n\n[9] Y. Fu, T. M. Hospedales, T. Xiang, & S. Gong. Learning multi-modal latent attributes. TPAMI, PP(99),\n\n2004.\n\n2013.\n\n[10] P. Kankuekul, A. Kawewong, S. Tangruamsub, & O. Hasegawa. Online Incremental Attribute-based\n\n[11] C. Lampert, H. Nickisch, & S. Harmeling. Attribute-based classi\ufb01cation for zero-shot learning of object\n\n[12] H.-T. Lin, C.-J. Lin, & R. C. Weng. A note on platt\u2019s probabilistic outputs for support vector machines.\n\nZero-shot Learning. In CVPR, 2012.\n\ncategories. TPAMI, PP(99), 2013.\n\nMachine Learning, 2007.\n\n[13] J. Liu, B. Kuipers, & S. Savarese. Recognizing human actions by attributes. In CVPR, 2011.\n[14] U. Luxburg. A tutorial on spectral clustering. Stat Comput, 17(4):395\u2013416, 2007.\n[15] M. Maier, U. V. Luxburg, & M. Hein. In\ufb02uence of graph construction on graph-based clustering measures.\n\nIn NIPS, 2008.\n\n[16] T. Mensink, J. Verbeek, F. Perronnin, & G. Csurka. Metric Learning for Large Scale Image Classi\ufb01cation:\n\nGeneralizing to New Classes at Near-Zero Cost. In ECCV, 2012.\n\n[17] Y. Moses, S. Ullman, & S. Edelman. Generalization to novel images in upright and inverted faces.\n\nPerception, 25:443\u2013461, 1996.\n\n[18] A. Y. Ng, M. I. Jordan, & Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, 2002.\n[19] M. Palatucci, D. Pomerleau, G. Hinton, & T. Mitchell. Zero-shot learning with semantic output codes. In\n\n[20] S. J. Pan & Q. Yang. A survey on transfer learning. TKDE, 22:1345\u201359, 2010.\n[21] R. Raina, A. Battle, H. Lee, B. Packer, & A. Ng. Self-taught learning: Transfer learning from unlabeled\n\nNIPS, 2009.\n\ndata. In ICML, 2007.\n\n[22] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, & B. Schiele. Script data for attribute-based\n\nrecognition of composite activities. In ECCV, 2012.\n\n[23] M. Rohrbach, M. Stark, & B. Schiele. Evaluating Knowledge Transfer and Zero-Shot Learning in a\n\nLarge-Scale Setting. In CVPR, 2011.\n\n[24] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, & B. Schiele. What Helps Where \u2013 And Why? Semantic\n\nRelatedness for Knowledge Transfer. In CVPR, 2010.\n\n[25] K. Saenko, B. Kulis, M. Fritz, & T. Darrell. Adapting visual category models to new domains. In ECCV,\n\n2010.\n\n[26] V. Sharmanska, N. Quadrianto, & C. H. Lampert. Augmented Attribute Representations. In ECCV, 2012.\n[27] A. Shrivastava, S. Singh, & A. Gupta. Constrained Semi-Supervised Learning Using Attributes and\n\n[28] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, & W. T. Freeman. Discovering Object Categories in\n\nComparative Attributes. In ECCV, 2012.\n\nImage Collections. In ICCV, 2005.\n\n[29] S. Thrun. Is learning the n-th thing any easier than learning the \ufb01rst. In NIPS, 1996.\n[30] A. Torralba, K. Murphy, & W. Freeman. Sharing visual features for multiclass and multiview object\n\ndetection. In CVPR, 2004.\n\n[31] D. Tran & A. Sorokin. Human activity recognition with metric learning. In ECCV, 2008.\n[32] M. Weber, M. Welling, & P. Perona. Towards automatic discovery of object categories. In CVPR, 2000.\n[33] D. Zhou, O. Bousquet, T. N. Lal, Jason Weston, & B. Sch\u00a8olkopf. Learning with Local and Global\n\n[34] X. Zhu, Z. Ghahramani, & J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and harmonic\n\n[35] A. Zweig & D. Weinshall. Exploiting object hierarchy: Combining models from different category levels.\n\nConsistency. In NIPS, 2004.\n\nfunctions. In ICML, 2003.\n\nIn ICCV, 2007.\n\n9\n\n\f", "award": [], "sourceid": 66, "authors": [{"given_name": "Marcus", "family_name": "Rohrbach", "institution": "Max Planck Institute for Informatics"}, {"given_name": "Sandra", "family_name": "Ebert", "institution": "Max Planck Institute for Informatics"}, {"given_name": "Bernt", "family_name": "Schiele", "institution": "Max Planck Institute for Informatics"}]}