{"title": "Semantic Kernel Forests from Multiple Taxonomies", "book": "Advances in Neural Information Processing Systems", "page_first": 1718, "page_last": 1726, "abstract": "When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient.  While an \\emph{object taxonomy} specifying the categories' semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that \\emph{are} relevant.  In light of these issues, we propose a discriminative feature learning approach that leverages \\emph{multiple} hierarchical taxonomies representing different semantic views of the object categories (e.g., for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats).  For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes.  Then, using the resulting \\emph{semantic kernel forest}, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class.  To learn the weights, we introduce a novel hierarchical regularization term that further exploits the taxonomies' structure.  We demonstrate our method on challenging object recognition datasets, and show that interleaving multiple taxonomic views yields significant accuracy improvements.", "full_text": "Semantic Kernel Forests from Multiple Taxonomies\n\nSung Ju Hwang\nUniversity of Texas\nAustin, TX 78701\n\nKristen Grauman\nUniversity of Texas\nAustin, TX 78701\n\nFei Sha\n\nUniversity of Southern California\n\nLos Angeles, CA 90089\n\nsjhwang@cs.utexas.edu\n\ngrauman@cs.utexas.edu\n\nfeisha@usc.edu\n\nAbstract\n\nWhen learning features for complex visual recognition problems, labeled image\nexemplars alone can be insuf\ufb01cient. While an object taxonomy specifying the cat-\negories\u2019 semantic relationships could bolster the learning process, not all relation-\nships are relevant to a given visual classi\ufb01cation task, nor does a single taxonomy\ncapture all ties that are relevant. In light of these issues, we propose a discrim-\ninative feature learning approach that leverages multiple hierarchical taxonomies\nrepresenting different semantic views of the object categories (e.g., for animal\nclasses, one taxonomy could re\ufb02ect their phylogenic ties, while another could re-\n\ufb02ect their habitats). For each taxonomy, we \ufb01rst learn a tree of semantic kernels,\nwhere each node has a Mahalanobis kernel optimized to distinguish between the\nclasses in its children nodes. Then, using the resulting semantic kernel forest, we\nlearn class-speci\ufb01c kernel combinations to select only those relationships relevant\nto recognize each object class. To learn the weights, we introduce a novel hier-\narchical regularization term that further exploits the taxonomies\u2019 structure. We\ndemonstrate our method on challenging object recognition datasets, and show that\ninterleaving multiple taxonomic views yields signi\ufb01cant accuracy improvements.\n\n1 Introduction\n\nObject recognition research has made impressive gains in recent years, with particular success\nin using discriminative learning algorithms to train classi\ufb01ers tuned to each category of interest\n(e.g., [1, 2]). As the basic \u201cimage features + labels + classi\ufb01er\u201d paradigm has reached a level of\nmaturity, we believe it is time to reach beyond it towards models that incorporate richer semantic\nknowledge about the object categories themselves.\n\nOne appealing source of such external knowledge is a taxonomy. A hierarchical semantic taxonomy\nis a tree that groups classes together in its nodes according to some human-designed merging or\nsplitting criterion. For example, well-known taxonomies include WordNet, which groups words\ninto sets of cognitive synonyms and their super-subordinate relations [3], and the phylogenetic tree\nof life, which groups biological species based on their physical or genetic properties. Critically,\nsuch trees implicitly embed cues about human perception of categories, how they relate to one\nanother, and how those relationships vary at different granularities. Thus, in the context of visual\nobject recognition, such a structure has the potential to guide the selection of meaningful low-level\nfeatures, essentially augmenting the standard supervision provided by image labels. Some initial\nsteps have been made based on this intuition, typically by leveraging the WordNet hierarchy as a\nprior on inter-class visual similarity [4, 5, 6, 7, 8, 9, 10, 11].\n\nTwo fundamental issues, however, complicate the use of a semantic taxonomy for learning visual\nobjects. First, a given taxonomy may offer hints about visual relatedness, but its structure need not\nentirely align with useful splits for recognition. (For example, monkey and dog are fairly distant\nsemantically according to WordNet, yet they share a number of visual features. An apple and apple-\nsauce are semantically close, yet are easily separable with basic visual features.) Second, given the\ncomplexity of visual objects, it is highly unlikely that some single optimal semantic taxonomy exists\nto lend insight for recognition. While previous work relies on a single taxonomy out of convenience,\n\n1\n\n\fBiological \n\nAppearance \n\nHabitat \n\nAnimal \n\nTexture \n\nTameness \n\ncanine \n\nfeline\n\nSpotted \n\nPointy \n\nEars \n\nDomestic \n\nWild \n\nDalmatian \n\nwolf \n\nSiamese \n\ncat \n\nleopard \n\nDalmatian \n\nLeopard \n\nSiamese \n\ncat \n\nWolf \n\nDalmatian \n\nSiamese \n\nCat \n\nWolf \n\nleopard \n\nFigure 1: Main idea: For a given set of classes, we assume multiple semantic taxonomies exist, each\none representing a different \u201cview\u201d of the inter-class semantic relationships. Rather than commit to\na single taxonomy\u2014which may or may not align well with discriminative visual features\u2014we learn\na tree of kernels for each taxonomy that captures the granularity-speci\ufb01c similarity at each node.\nThen we show how to exploit the inter-taxonomic structure when learning a combination of these\nkernels from multiple taxonomies (i.e., a \u201ckernel forest\u201d) to best serve the object recognition tasks.\n\nin reality objects can be organized along many semantic dimensions or \u201cviews\u201d. (For example, a\nDalmatian belongs to the same group as the wolf according to a biological taxonomy, as both are ca-\nnines. However, in terms of visual attributes, it can be grouped with the leopard, as both are spotted;\nin terms of habitat, it can be grouped with the Siamese cat, as both are domestic. See Figure 1.)\n\nMotivated by these issues, we present a discriminative feature learning approach that leverages mul-\ntiple taxonomies capturing different semantic views of the object categories. Our key insight is\nthat some combination of the semantic views will be most informative to distinguish a given visual\ncategory. Continuing with the sketch in Figure 1, that might mean that the \ufb01rst taxonomy helps\nlearn dog- and cat-like features, while the second taxonomy helps elucidate spots and pointy corner\nfeatures, while the last reveals context cues such as proximity to humans or indoor scene features.\nWhile each view differs in its implicit human-designed splitting criterion, all separate some classes\nfrom others, thereby lending (often complementary) discriminative cues. Thus, rather than commit\nto a single representation, we aim to inject pieces of the various taxonomies as needed.\n\nTo this end, we propose semantic kernel forests. Our method takes as input training images labeled\naccording to their object category, as well as a series of taxonomies, each of which hierarchically\npartitions those same labels (object classes) by a different semantic view. For each taxonomy, we\n\ufb01rst learn a tree of semantic kernels: each node in a tree has a Mahalanobis-based kernel optimized to\ndistinguish between the classes in its children nodes. The kernels in one tree isolate image features\nuseful at a range of category granularities. Then, using the resulting semantic kernel forest from\nall taxonomies, we apply a form of multiple kernel learning (MKL) to obtain class-speci\ufb01c kernel\ncombinations, in order to select only those relationships relevant to recognize each object class. We\nintroduce a novel hierarchical regularization term into the MKL objective that further exploits the\ntaxonomies\u2019 structure. The output of the method is one learned kernel per object class, which we\ncan then deploy for one-versus-all multi-class classi\ufb01cation on novel images.\n\nOur main contribution is to simultaneously exploit multiple semantic taxonomies for visual fea-\nture learning. Whereas past work focuses on building object hierarchies for scalable classi\ufb01ca-\ntion [12, 13] or using WordNet to gauge semantic distance [5, 6, 8, 9], we learn discriminative ker-\nnels that capitalize on the cues in diverse taxonomy views, leading to better recognition accuracy.\nThe primary technical contributions are i) an approach to generate semantic base kernels across tax-\nonomies, ii) a method to integrate the complementary cues from multiple suboptimal taxonomies,\nand iii) a novel regularizer for multiple kernel learning that exploits hierarchical structure from the\ntaxonomy, allowing kernel selection to bene\ufb01t from semantic knowledge of the problem domain.\n\nWe demonstrate our approach with challenging images from the Animals with Attributes and Im-\nageNet datasets [14, 7] together with taxonomies spanning cognitive synsets, visual attributes, be-\nhavior, and habitats. Our results show that the taxonomies can indeed boost feature learning, letting\nus bene\ufb01t from humans\u2019 perceived distinctions as implicitly embedded in the trees. Furthermore,\nwe show that interleaving the forest of multiple taxonomic views leads to the best performance,\nparticularly when coupled with the proposed novel regularization.\n\n2\n\n\f2 Related Work\n\nLeveraging hierarchies for object recognition Most work in object recognition that leverages\ncategory hierarchy does so for the sake of ef\ufb01cient classi\ufb01cation [15, 16, 12, 13, 17]. Making coarse\nto \ufb01ne predictions along a tree of classi\ufb01ers ef\ufb01ciently rules out unlikely classes at an early stage.\nSince taxonomies need not be ideal structures for this goal, recent work focuses on novel ways to\noptimize the tree structure itself [12, 13, 17], while others consider splits based on initial inter-class\nconfusions [16]. A parallel line of work explores unsupervised discovery of hierarchies for image\norganization and browsing, from images alone [18, 19] or from images and tags [20]. Whereas all\nsuch work exploits tree structures to improve ef\ufb01ciency (whether in classi\ufb01cation or browsing), our\ngoal is for externally de\ufb01ned semantic hierarchies to enhance recognition accuracy.\n\nMore related to our problem setting are techniques that exploit the inter-class relationships in a\ntaxonomy [5, 6, 8, 9, 10, 11]. One idea is to combine the decisions of classi\ufb01ers along the semantic\nhierarchy [5, 4]. Alternatively, the semantic \u201cdistance\u201d between nodes can be used to penalize\nmisclassi\ufb01cations more meaningfully [9], or to share labeled exemplars between similar classes [8].\nMetric learning and feature selection can also bene\ufb01t from an object hierarchy, either by preferring\nto use disjoint feature sets to discriminate super- and sub-classes [10], by using a taxonomy-induced\nloss for structured sparsity [21], or by sharing parameters between metrics along the same path [11].\nAll prior work commits to a single taxonomy, however, which as discussed above may restrict the\nsemantics\u2019 impact and will not always align well with the visual data.\n\nClassi\ufb01cation with multiple semantic views Combining information from multiple \u201cviews\u201d of\ndata is a well-researched topic in the machine learning, multimedia, and computer vision commu-\nnities. In multi-view learning, the training data typically consists of paired examples coming from\ndifferent modalities\u2014e.g., text and images, or speech and video; basic approaches include recov-\nering the underlying shared latent space for both views [22, 20], bootstrapping classi\ufb01ers formed\nindependently per feature space [23, 24], or accounting for the view dependencies during cluster-\ning [25, 26]. When the classi\ufb01cation tasks themselves are grouped, multi-task learning methods\nleverage the parallel tasks to regularize parameters learned for the individual classi\ufb01ers or features\n(e.g., [27, 28, 29]). Broadly speaking, our problem has a similar spirit to such settings, since we want\nto leverage multiple parallel taxonomies over the data; however, our goal to aggregate portions of\nthe taxonomies during feature learning is quite distinct. More speci\ufb01cally, while previous methods\nattempt to \ufb01nd a single structure to accommodate both views, we seek complementary information\nfrom the semantic views and assemble task-speci\ufb01c discriminative features.\n\nLearning kernel combinations Multiple kernel learning (MKL) algorithms [30] have shown\npromise for image recognition (e.g., [31, 32]) and are frequently employed in practice as a prin-\ncipled way to combine feature types. Our approach also employs a form of MKL, but rather than\npool kernels stemming from different low-level features or kernel hyperparameters, it pools kernels\nstemming from different semantic sources. Furthermore, our addition of a novel regularizer exploits\nthe hierarchical structure from which the kernels originate.\n\n3 Approach\n\nWe cast the problem of learning semantic features from multiple taxonomies as learning to combine\nkernels. The base kernels capture features speci\ufb01c to individual taxonomies and granularities within\nthose taxonomies, and they are combined discriminatively to improve classi\ufb01cation, weighing each\ntaxonomy and granularity only to the extent useful for the target classi\ufb01cation task.\n\nWe describe the two main components of the approach in turn: learning the base kernels\u2014which we\ncall a semantic kernel forest (Sec. 3.1), and learning their combination across taxonomies (Sec. 3.2),\nwhere we devise a new hierarchical regularizer for MKL.\n\nIn what follows, we assume that we are given a labeled dataset D = {(xi, yi)}N\nn=1 where (xi, yi)\nstands for the ith instance (feature vector) and its class label is yi, as well as a set of tree-structured\ntaxonomies {Tt}T\nt=1. Each taxonomy Tt is a collection of nodes. The leaf nodes correspond to class\nlabels, and the inner nodes correspond to superclasses\u2014or, more generally, semantically meaningful\ngroupings of categories. We index those nodes with double subscripts tn, where t refers to the tth\ntaxonomy and n to the nth node in that taxonomy. Without loss of generality, we assign the leaf\nnodes (i.e., the class nodes) a number between 1 and C, where C is the number of class labels.\n\n3\n\n\f3.1 Learning a semantic kernel forest\n\nOur \ufb01rst step is to learn a forest of base kernels. These kernels are granularity- and view-speci\ufb01c;\nthat is, they are tuned to similarities implied by the given taxonomies. While base kernels are learned\nindependently per taxonomy, they are learned jointly within each taxonomy, as we describe next.\n\nFormally, for each taxonomy Tt, we learn a set of Gaussian kernels for the superclass at every\ninternal node tn for which n \u2265 C + 1. The Gaussian kernels are parameterized as\n\nKtn(xi, xj) = exp{\u2212\u03b3tnd2\n\n(1)\nwhere the Mahalanobis distance metric Mtn is used in lieu of the conventional Euclidean metric.\nNote that for leaf nodes where n \u2264 C, we do not learn base kernels.\n\nMtn (xi, xj)} = exp{\u2212\u03b3tn(xi \u2212 xj)TMtn(xi \u2212 xj)},\n\nWe want the base kernels to encode similarity between examples using features that re\ufb02ect their\nrespective granularity in the taxonomy. Certainly, the kernel Ktn should home in on features that\nare helpful to distinguish the node tn\u2019s subclasses. Beyond that, however, we speci\ufb01cally want it\nto use features that are as different as possible from the features used by its ancestors. Doing so\nensures that the subsequent combination step can choose a sparse set of \u201cdisconnected\u201d features.\n\nTo that end, we apply our Tree of Metrics (ToM) technique [10] to learn the Mahalanobis param-\neters Mtn. In ToM, metrics are learned by balancing two forces: i) discriminative power and ii) a\npreference for different features to be chosen between parent and child nodes. The latter exploits the\ntaxonomy semantics, based on the intuition that features used to distinguish more abstract classes\n(dog vs. cat) should differ from those used for \ufb01ner-grained ones (Siamese vs. Persian cat).\n\nBrie\ufb02y, for each node tn, the training data is reduced to Dn = {(xi, yin)}, where yin is the label\nof n\u2019s child on the path to the leaf node yi. If yi is not a descendant of the superclass at the node n,\nthen xi is excluded from Dn. The metrics are learned jointly, with each node mutually encouraging\nthe others to use non-overlapping features. ToM achieves this by augmenting a large margin nearest\nneighbor [33] loss function Pn \u2113(Dn; Mtn) with the following disjoint sparsity regularizer:\n\n\u2126d(M ) = \u03bb X\n\nTrace[Mtn] + \u00b5 X\n\nX\n\nkdiag(Mtn) + diag(Mtm)k2\n2,\n\n(2)\n\nn\u2265C+1\n\nn\u2265C+1\n\nm\u223cn\n\nwhere m \u223c n denotes that node m is either an ancestor or descendant of n. The \ufb01rst part of\nthe regularizer encourages sparsity in the diagonal elements of Mtn, and the second part incurs a\npenalty when two different metrics \u201ccompete\u201d for the same diagonal element, i.e., to use the same\nfeature dimension. The resulting optimization problem is convex and can be solved ef\ufb01ciently [10].\n\nAfter learning the metrics {Mtn} in each taxonomy, we construct base kernels as in eq. (1). The\nbandwidths \u03b3tn are set as the average distances on training data. We call the collection F = {Ktn}\nof all base kernels the semantic kernel forest. Figure 1 shows an illustrative example.\n\nWhile ToM has shown promising results in learning metrics in a single taxonomy, its reliance on\nlinear Mahalanobis metrics is inherently limited. A straightforward convex combination of ToMs\nwould result in yet another linear mapping, incapable of capturing nonlinear inter-taxonomic inter-\nactions. In contrast, our kernel approach retains ToM\u2019s granularity-speci\ufb01c features but also enables\nnontrivial (nonlinear) combinations, especially when coupled with a novel hierarchical regularizer,\nwhich we will de\ufb01ne next.\n\n3.2 Learning class-speci\ufb01c kernels across taxonomies\n\nBase kernels in the semantic kernel forest are learned jointly within each taxonomy but indepen-\ndently across taxonomies. To leverage multiple taxonomies and to capture different semantic views\nof the object categories, we next combine them discriminatively to improve classi\ufb01cation.\n\nBasic setting To learn class-speci\ufb01c features (or kernels), we compose a one-versus-rest supervised\nlearning problem. Additionally, instead of combining all the base kernels in the forest F , we pre-\nselect a subset of them based on the taxonomy structure. Speci\ufb01cally, from each taxonomy, we\nselect base kernels that correspond to the nodes on the path from the root to the leaf node class. For\nexample, in the Biological taxonomy of Figure 1, for the category Dalmatian, this path includes the\nnodes (superclasses) canine and animal. Thus, for class c, the linearly combined kernel is given by\n\nFc(xi, xj) = X\n\nX\n\n\u03b2ctnKtn(xi, xj),\n\n(3)\n\nt\n\nn\u223cc\n\n4\n\n\fwhere n \u223c c indexes the nodes that are ancestors of c, which is a leaf node (recall that the \ufb01rst C\nnodes in every taxonomy are reserved for leaf class nodes). The combination coef\ufb01cients \u03b2ctn are\nconstrained to be nonnegative to ensure the positive semide\ufb01niteness of the resulting kernel Fc(\u00b7, \u00b7).\n\nWe apply the kernel Fc(\u00b7, \u00b7) to construct the one-versus-rest binary classi\ufb01er to distinguish instances\nfrom class c from all other classes. We then optimize \u03b2c = {\u03b2ctn} such that the classi\ufb01er attains\nthe lowest empirical misclassi\ufb01cation risk. The resulting optimization (in its dual formulation) is\nanalogous to standard multiple kernel learning [30]:\n\nmin\n\u03b2c\n\nmax\n\u03b1c\n\n\u03b1ci \u2212\n\nX\n\ni\n\n1\n2 X\n\ni\n\nX\n\nj\n\n\u03b1ci\u03b1cjqciqcjFc(xi, xj)\n\ns.t. X\n\ni\n\n\u03b1ciqci = 0, 0 \u2264 \u03b1ci \u2264 C, \u2200 i,\n\n(4)\n\nwhere \u03b1c is the Lagrange multipliers for the binary SVM classi\ufb01er, C is the regularizer for the\nSVM\u2019s hinge loss function, and qci = \u00b11 is the indicator variable of whether or not xi\u2019s label is c.\n\nHierarchical regularization Next, we extend the basic setting to incorporate richer modeling\nassumptions. We hypothesize that kernels at higher-level nodes should be preferred to lower-level\nnodes. Intuitively, higher-level kernels relate to more classes, thus are likely essential to reduce loss.\n\nWe leverage this intuition and knowledge about the relative priority of the kernels from each taxon-\nomy\u2019s hierarchical structure. We design a novel structural regularization that prefers larger weights\nfor a parent node compared to its children. Formally, the proposed MKL-H regularizer is given by:\n\n\u2126(\u03b2c) = \u03bb X\n\n\u03b2ctn + \u00b5 X\n\nmax(0, \u03b2ctn \u2212 \u03b2ctpn + 1).\n\n(5)\n\nt,n\u223cc\n\nt,n\u223cc\n\nThe \ufb01rst part prefers a sparse set of kernels. The second part (in the form of hinge loss) encodes our\ndesire to have the weight assigned to a node n be less than the weight assigned to the node\u2019s parent\npn. We also introduce a margin of 1 to further increase the difference between the two weights.\n\nHierarchical regularization was previously explored in [34], where a mixed (1, 2)-norm is used to\nregularize the relative sizes between the parent and the children. The main idea there is to discard\nchildren nodes if the parent is not selected. Our regularizer is similar, but is simpler and more com-\nputationally ef\ufb01cient. (Additionally, our preliminary studies show [34] has no empirical advantage\nover our approach in improving recognition accuracy.)\n\n3.3 Numerical optimization\n\nOur learning problem is cast as a convex optimization that balances the discriminative loss in eq. (4)\nand the regularizer in eq. (5):\n\nmin\n\u03b2c\n\nf (\u03b2c) = g(\u03b2c) + \u2126(\u03b2c),\n\ns.t. \u03b2c \u2265 0,\n\n(6)\n\nwhere we use the function g(\u03b2) to encapsulate the inner maximization problem over \u03b1c in eq. (4).\n\nWe use the projected subgradient method to solve eq. (6), for its ease of implementation and practical\neffectiveness [35]. Speci\ufb01cally, at iteration t, let \u03b2t\nc be the current value of \u03b2c. We compute f (\u03b2c)\u2019s\nsubgradient st, then perform the following update,\n\nwhere the max( ) function implements the projection operation such that the update does not fall\noutside of the feasible region \u03b2c \u2265 0. For step size \u03b1t, we use the modi\ufb01ed Polyak\u2019s step size [36].\n\nc \u2190 max (cid:0)0, \u03b2t\n\u03b2t+1\n\nc \u2212 \u03b1tst(cid:1) ,\n\n(7)\n\n4 Experiments\n\nWe validate our approach on multiple image datasets, and compare to several informative baselines.\n\n4.1 Image datasets and taxonomies\n\nWe consider two publicly available image collections: Animals with Attributes (AWA) [14] and\nImageNet [7]1. We form two datasets from AWA. The \ufb01rst consists of the four classes shown in\n\n1attributes.kyb.tuebingen.mpg.de/ and image-net.org/challenges/LSVRC/2011/\n\n5\n\n\fp\nl\na\n\nc\n\ne\n\nn\nt\na\n\nl(cid:16)\n\nc\n\na\n\nr\n\nn\n\niv\n\ne\n\nv\n\ne\n\nn\n\na\n\nq\n\nu\n\np\n\nr\n\no\n\n\u2212\nt\no\n\ne\n\na\n\ntic\n(cid:14)\n\ns\n\nc\n\ny\n\no\n\no\n\nr\n\ne\n(cid:15)\n\nf\ne\n\nli\n\nn\n\ne\n(cid:12)\n\nn\ni\nd\nl\n(cid:11)\ne\no\n\nP\n\ne\n\nc\n\nh\ni\n\nr\n\na\n\nt(cid:9)\n\np\ni\ng\n(cid:5)\n\nm\n\nh\ni\np\n\nh\n\nd\n(cid:13)\n\nu\n\np\n\nm\n\ne\n\ng\ni\na\nl\na\n\nr\n\na\n\nc\n\no\n\np\n\np\n\nb\n\nn\n\nc\n\np\n\nt \n\np\n\no\n\no\n\na\n\nr\n\nr\nsi\n\na\n\np\n\na\n\nn\n\nz\n\ne\n\ne\n(cid:1)\n\no\nt\na\n\na\n\nc\n\nm\n\nk\n(cid:7)\n\nu\n\ns\n(cid:6)\n\na\n\nn\n\nn\n(cid:8)\n\nd\n\na\n(cid:2)\n\nd\n(cid:3)\n\nn\n \nc\n\na\n\nt(cid:4)\n\ng\ni\na\n\nc\n\nh\ni\n\nl\ne\n\nP\n\no\n\ne\n\nn\n\nt \n\np\n\na\n\nm\n\np\n\np\n\na\n\na\n\nr\n\nr\n\na\n\nr\n\na\n\nc\n\nc\n\nr\nsi\n\na\n\no\n\no\n\nd\n(cid:3)\n\nn\n \nc\n\nn\n(cid:8)\n\nn\n\nz\n\nn\n\nd\n\na\n(cid:2)\n\ne\n\ne\n(cid:1)\n\na\n\nt(cid:4)\n\nt(cid:9)\n\np\ni\ng\n(cid:5)\n\nh\ni\np\n\nh\n\nu\n\np\n\nm\n\ns\n\ne\n\na\nl\n\nl\ne\n\nc\n\nh\ni\n\nr\n\nr\n\na\n\na\n\nc\n\nc\n\nt(cid:9)\n\np\n\nm\n\no\n\no\n\np\n\np\n\nb\n\no\nt\na\n\na\n\nc\n\nm\n\nk\n(cid:7)\n\nu\n\ns\n(cid:6)\n\na\n\nr\n\np\n\na\n\no\n\no\n\nd\n(cid:3)\n\nn\n\nz\n\nn\n(cid:8)\n\ne\n\ne\n(cid:1)\n\ng\ni\na\n\nP\n\ne\n\nn\n\nt \n\nr\nsi\n\na\n\np\n\na\n\np\ni\ng\n(cid:5)\n\nh\ni\np\n\nh\n\nu\n\np\n\nm\n\ns\n\ne\n\na\nl\n\nn\n \nc\n\nn\n\nd\n\na\n\na\n(cid:2)\n\nt(cid:4)\n\no\n\np\n\np\n\nb\n\no\nt\na\n\na\n\nc\n\nm\n\nk\n(cid:7)\n\nu\n\ns\n(cid:6)\n\ns\n\nh\n\nu\n\nm\n\ne\n\nc\na\nl\n\nh\ni\n\nh\ni\np\n\ng\ni\na\n\nl\ne\n\nP\n\no\n\ne\n\nm\n\np\n\nn\n\np\n\no\n\nt \n\np\n\np\n\np\n\na\n\na\n\nr\n\nd\n(cid:3)\n\nn\n\nz\n\no\nt\na\n\na\n\ne\n\nm\n\nn\n\nd\n\ne\n(cid:1)\n\nu\n\na\n(cid:2)\n\nr\n\nr\n\na\n\na\n\nc\n\nc\n\nt(cid:9)\n\np\ni\ng\n(cid:5)\n\nr\nsi\n\na\n\no\n\no\n\nn\n(cid:8)\n\nn\n \nc\n\na\n\nt(cid:4)\n\np\n\nb\n\na\n\nc\n\nk\n(cid:7)\n\ns\n(cid:6)\n\n(a) WordNet\n\n(b) Appearance\n\n(c) Behavior\n\n(d) Habitat\n\n(cid:30)\n\ni\nn\n\ns\n\ntr\n\nu\n\nm\n\ne\n\nv\n\na\n\ns\n\nc\n\nu\nl\na\n\nfl\n\no\n\nw\n\nr \n\np\nl\na\n\ne\n\nr(cid:26)\n\nn\n\nt(cid:28)\n\nb\n\nd\n\ns\n\nb\n\nb\n\nc\n\np\n\nfr\n\nu\n\nit(cid:25)\n\nn\nt\na\n\nd\n\ne\n\nlit\n\nvic\ny\n(cid:29)\n\ne\n(cid:27)\n\nf\nu\n\nr\n\nn\n\nit\n\nl\na\n\np\n\nm\n\nu\n\nr\n\no\n\np\n\no\n\ne\n(cid:24)\nr\nu\nl\ne\n(cid:15)\n\nlt\n\na\n\nf\na\n\ns\nt\ne\n\nn\n\np\n\ne\n\nr\n\nc\n\nu\n\ns\n\nri\n\nd\n\ne\n(cid:23)\n\ne\n\nr(cid:21)\nr\n\nd\n\nb\n\nb\n\nm\n\nu\n\na\n\nu\n\nu\n\ntt\n\no\n\nc\n\nkl\n\ne\n(cid:4)\n\nn\n(cid:5)\n\nm\n\n(cid:7)\n\nsi\nf\nr\no\ne\nn\nr\n(cid:22)\nris\n\nri\n\nm\n\no\n\nll\n\ne\n\nr\n\nc\n\nw\n\ns\n\na\n\nf\ne\n\nb\n\nri\n\nd\n\na\nt\nh\n\ng\n\ntr\n\nw\n\nc\n\no\n\na\n\nu\n\na\n\na\n\no\n\nn\n\nn\n\ns\n\nis\n\nr\n\nn\n(cid:19)\n\ns\n\na\n\ni(cid:20)\n\nfl\n\no\n\ny\n(cid:17)\n\nw\n\ne\n(cid:3)\n\ne\n\nr\n\nb\n\nb\n\ne\n\nr\n\nr\n\no\n\na\n(cid:8)\n\ny\n(cid:16)\n\no\n\nlic\n\nb\n(cid:6)\n\nm\n\nk\n\no\n\na\nt\nh\nt\nu\n\ne\nt\nb\n\ne\n\nr(cid:18)\n\na\n\nll(cid:1)\n\nb\n(cid:2)\n\ne\n\nv\n\na\n\nn\n(cid:12)\n\nb\nl\ne\n(cid:13)\n\nb\n\na\n(cid:11)\n\nh\n\ne\n\no\n\na\n\ne\n\nl(cid:9)\n\ns\nt\ne\n\nr(cid:14)\n\np\n\np\n\ns\n\nb\n\nb\n\nm\n\nb\n\nf\ne\n\nr\n\no\n\ntr\n\no\n\no\n\na\n\nn\n\nll\n\ne\n\nri\n\nd\n\nw\n\ns\n\no\n\no\n\nr\nris\n\nlic\n\nw\n\ne\n\nv\n\nlt\n\na\n\nr\n\nc\n\ng\n\no\n\ne\n(cid:3)\n\na\n\ni(cid:20)\n\nm\n\ne\nt\nb\n\nb\n\na\n\na\n\nri\n\ns\n\nk\n\na\nt\nh\n\nis\n\na\n\nu\n\nu\n\nc\n\nn\n\nfl\n\no\n\ntt\n\no\n\nw\n\nn\n(cid:5)\n\no\n\nr\n\na\nt\nh\nt\nu\n\nn\n(cid:19)\n\nd\n\ns\n\nb\n\na\n\nb\n\nl\na\n\nb\n\nf\ne\n\nd\n\nc\n\nr\n\nm\n\nu\n\np\n\nc\n\nr\n\nu\n\no\n\nm\n\nu\nl\ne\n(cid:15)\n\nkl\n\ne\n(cid:4)\n\nm\n\n(cid:7)\n\nb\n(cid:6)\n\nb\n(cid:2)\n\nh\n\ne\n\nb\n\na\n\nb\nl\ne\n(cid:13)\n\nn\n(cid:12)\n\ne\n\nr\n\nr\n\ne\n\nl(cid:9)\n\ny\n(cid:16)\n\na\n(cid:11)\n\na\n\ns\nt\ne\n\nr(cid:14)\n\ny\n(cid:17)\n\na\n\nb\n\no\n\ne\n\nr\nll(cid:1)\n\na\n(cid:8)\n\ne\n\nr(cid:18)\n\np\n\ns\n\nb\n\nd\n\nr\n\nu\n\no\n\nu\n\na\n\nlic\n\nn\n\ns\n\nfl\n\no\n\nw\n\na\n\ne\n\nv\n\nk\n\nm\n\ne\nt\nb\n\nr\n\nm\n\np\n\no\n\nf\ne\n\nr\nr\nris\n\no\n\nll\n\no\n\nlt\n\nm\n\na\n\nw\n\na\n\nri\n\ne\n\nr\n\nc\n\nu\nl\ne\n(cid:15)\n\n(cid:7)\n\nb\n\nb\n\nu\n\nu\n\ntt\n\no\n\nc\n\nkl\n\nn\n(cid:5)\n\ne\n(cid:4)\n\nn\n(cid:12)\n\ne\n\nr(cid:18)\n\na\n\nll(cid:1)\n\nb\n\na\n(cid:11)\n\nb\nl\ne\n(cid:13)\n\nh\n\ne\n\no\n\na\n\ne\n\nl(cid:9)\n\ns\nt\ne\n\nr(cid:14)\n\nl\na\n\nb\n\nc\n\nm\n\nri\n\np\n\nd\n\no\n\nm\n\nf\ne\n\nb\n\nd\n\ns\n\na\n\nb\n\na\nt\nh\n\na\n\na\nt\nh\nt\nu\n\ne\n\nis\n\ny\n(cid:17)\n\ntr\n\nc\n\no\n\na\n\no\n\nn\n\nw\n\nb\n\nr\n\nn\n(cid:19)\n\ns\n\na\n\ni(cid:20)\n\ng\n\nb\n(cid:6)\n\ne\n(cid:3)\n\nr\n\nb\n\nb\n(cid:2)\n\no\n\na\n(cid:8)\n\ne\n\nr\n\nr\n\ny\n(cid:16)\n\n(e) WordNet\n\n(f) Appearance\n\n(g) Attributes\n\nFigure 2: Taxonomies for the AWA-10 (a-d) and ImageNet-20 (e-g) datasets.\n\nFig. 1, and totals 2, 228 images; the second contains the ten classes in [14], and totals 6, 180 images.\nWe refer to them as AWA-4 and AWA-10, respectively. The third dataset, ImageNet-20, consists of\n28, 957 total images spanning 20 classes from ILSVRC2010. We chose classes that are non-animals\n(to avoid overlap with AWA) and that have attribute labels [37].\n\nTo obtain multiple taxonomies per dataset, we use attribute labels and WordNet. Attributes are hu-\nman understandable properties shared among object classes, e.g., \u201cfurry\u201d, \u201c\ufb02at\u201d, \u201ccarnivorous\u201d [14].\nAWA and ImageNet have 85 and 25 attribute labels, respectively. To form semantic taxonomies\nbased on attributes, we \ufb01rst manually divide the attribute labels into subsets according to their mutual\nsemantic relevance (e.g., \u201cfurry\u201d and \u201cshiny\u201d are attributes relevant for an Appearance taxonomy,\nwhile \u201cland-dwelling\u201d and \u201caquatic\u201d are relevant for a Habitat taxonomy). Then, for each subset of\nattributes, we perform agglomerative clustering using Euclidean distance on vectors of the training\nimages\u2019 real-valued attributes. We restrict the tree height (6 for ImageNet and 3 for AWA) to ensure\nthat the branching factor at the root is not too high. To extract a WordNet taxonomy, we \ufb01nd all\nnodes in WordNet that contain the object class names on their word lists, and then build a hierarchy\nby pruning nodes with only one child and resolving multiple parentship.\n\nFor AWA-10, we use 4 taxonomies: one from WordNet, and three based on attribute subsets re-\n\ufb02ecting Appearance, Behavior, and Habitat ties. For ImageNet-20, we use 3 taxonomies: one from\nWordNet, one re\ufb02ecting Appearance as found by hierarchical clustering on the visual features, and\none re\ufb02ecting Attributes using annotations from [37]. For the AWA-4 taxonomies, we simply gen-\nerate all 3 possible 2-level binary trees, which, based on manual observation, yield taxonomies\nre\ufb02ecting Biological, Appearance, and Habitat ties between the animals. See Figures 1 and 2.\n\nWe stress that these taxonomies are created externally with human knowledge, and thus they inject\nperceived object relationships into the feature learning problem. This is in stark contrast to prior\nwork that focuses on optimizing hierarchies for ef\ufb01ciency, without requiring interpretability of the\ntrees themselves [16, 12, 13, 17].\n\n4.2 Baseline methods for comparison\n\nWe compare our method to three key baselines: 1) Raw feature kernel: an RBF kernel computed\non the original image features, with the \u03b3 parameter set to the inverse of the mean Euclidean distance\nd among training instances. 2) Raw feature kernel + MKL: MKL combination of multiple such\nRBF kernels constructed by varying \u03b3, which is a traditional approach to generate base kernels\n(e.g., [30]). For this baseline, we generate the same number N of base kernels as in the semantic\nkernel forest, with \u03b3 = \u03c3\n2 . 3) Perturbed semantic\nkernel tree: a semantic kernel tree trained with taxonomies that have randomly swapped leaves.\n\nd , for \u03c3 = {21\u2212m, . . . , 2N \u2212m}, where m = N\n\n6\n\n\fRaw feature kernel\n\nRaw feature kernel + MKL\n\nAWA-4\n\n47.67 \u00b1 2.22\n48.50 \u00b1 1.89\n\nPerturbed semantic kernel tree + MKL-H\nPerturbed semantic kernel forest + MKL-H\n\nN/A\nN/A\n\nSemantic kernel tree + Avg\nSemantic kernel tree + MKL\n\nSemantic kernel tree + MKL-H\nSemantic kernel forest + MKL\n\nSemantic kernel forest + MKL-H\n\n47.17 \u00b1 2.40\n48.89 \u00b1 1.06\n50.06 \u00b1 1.12\n49.67 \u00b1 1.11\n52.83 \u00b1 1.68\n\nAWA-10\n\n30.80 \u00b1 1.36\n31.13 \u00b1 2.81\n31.53 \u00b1 2.07\n33.20 \u00b1 2.96\n\n31.92 \u00b1 1.21\n32.43 \u00b1 1.93\n32.68 \u00b1 1.79\n34.60 \u00b1 1.78\n35.87 \u00b1 1.22\n\nImageNet-20\n28.20 \u00b1 1.45\n27.67 \u00b1 1.50\n28.20 \u00b1 2.02\n30.77 \u00b1 1.53\n\n28.97 \u00b1 1.61\n29.74 \u00b1 1.26\n29.90 \u00b1 0.70\n30.97 \u00b1 1.14\n32.30 \u00b1 1.00\n\nTable 1: Multi-class classi\ufb01cation accuracy on all datasets, across 5 train/test splits. (The perturbed semantic\nkernel tree baseline is not applicable for AWA-4, since all possible groupings are present in the taxonomies.)\n\nt\n\nn\ne\nm\ne\nv\no\nr\np\nm\n\ni\n \ny\nc\na\nr\nu\nc\nc\nA\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n \n\nl\n\na\ne\ns\n\nAWA\u221210\n\n \n\nWordnet (1.73)\nAppearance (1.00)\nBehavior (2.53)\nHabitat (2.27)\nAll (5.07)\n\nt\nn\ne\nm\ne\nv\no\nr\np\nm\n\ni\n \ny\nc\na\nr\nu\nc\nc\nA\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n \n\nt\na\nc\n \n.\n\nP\n\no\np\np\nh\n\ni\n\nn\no\no\nc\nc\na\nr\n\ng\np\n\ni\n\np\nm\nh\nc\n\ni\n\na\nd\nn\na\np\n\nt\na\nr\n\nd\nr\na\np\no\ne\n\nl\n\nImagenet\u221220\n\n \n\nWordnet (0.73)\nVisual (1.97)\nAttributes (2.40)\nAll (4.10)\n\nl\n\ne\ne\nh\nw\ns\ni\nr\nr\ne\n\nf\n\ny\nr\nr\ne\nb\nw\na\nr\nt\ns\n\ne\ng\nd\ni\nr\nb\n\nb\nu\n\nt\n\nh\n\nt\n\na\nb\n\na\no\nb\nr\ne\nh\n\nt\n\na\ne\n\nf\n\nb\nm\no\nc\n\np\nm\na\n\nl\n\nl\nl\n\na\nb\n\nt\n\ne\nk\ns\na\nb\n\nn\no\n\nt\nt\n\nu\nb\n\nm\nu\nr\nd\n\nr\ne\nw\no\n\nl\nf\n\nn\nu\ns\n\nn\na\nv\ne\nc\n\ni\nl\n\no\np\n\nl\n\ne\nb\na\n\nt\nl\n\no\no\np\n\ni\n\na\ns\nn\no\nb\n\ni\n\ny\ns\na\nd\n\nr\ne\n\nt\ns\na\no\nc\n \n.\nr\n\nn\nr\no\nc\na\n\nl\n\ne\nk\nc\nu\nb\n\nl\n\ne\nu\nr\n\na\nb\nm\n\ni\nr\na\nm\n\nl\n\ne\na\nh\nw\n\n \n.\nh\n\nFigure 3: Per-class accuracy improvements of each individual taxonomy and the semantic kernel forest (\u201cAll\u201d)\nover the raw feature kernel baseline. Numbers in legends denote mean improvement. Best viewed in color.\n\nThe \ufb01rst two baselines will show the accuracy attainable using the same image features and basic\nclassi\ufb01cation tools (SVM, MKL) as our approach, but lacking the taxonomy insights. The last\nbaseline will test if weakening the semantics in the taxonomy has a negative impact on accuracy.\n\nWe evaluate several variants of our approach, in order to analyze the impact of each component: 1)\nSemantic kernel tree + Avg: an equal-weight average of the semantic kernels from one taxonomy.\n2) Semantic kernel tree + MKL: the same kernels, but combined with MKL using sparsity regu-\nlarization only (i.e., \u00b5 = 0 in eq. 5). 3) Semantic kernel tree + MKL-H: the same as previous,\nbut adding the proposed hierarchical regularization (eq. 5). 4) Semantic kernel forest + MKL:\nsemantic forest kernels from multiple taxonomies combined with MKL. 5) Semantic kernel forest\n+ MKL-H: the same as previous, but adding our hierarchical regularizer.\n\n4.3 Implementation details\n\nFor all results, we use 30/30/30 images per class for training/validation/testing, and generate 5\nsuch random splits. We report average multi-class recognition accuracy and standard errors for\n95% con\ufb01dence interval. For single taxonomy results, we report the average over all individual\ntaxonomies. For all methods, the raw image features are bag-of-words histograms obtained on SIFT,\nprovided with the datasets. We reduce their dimensionality to 100 with PCA to speed up the ToM\ntraining, following [10]. To train ToM, we sample 400 random constraints and cross-validate the\nregularization parameters \u03bb, \u03b3 \u2208 {0.1, 1, 10}. For MKL/MKL-H, we use C = 1000 for the C-SVM\nparameter, and cross-validate the sparsity and hierarchical parameters \u03bb, \u00b5 \u2208 {0, 0.1, 1, 10}.\n\n4.4 Results\n\nQuantitative results Table 1 shows the multi-class classi\ufb01cation accuracy on all three datasets.\nOur semantic kernel forests approach signi\ufb01cantly outperforms all three baselines. It improves ac-\ncuracy for 9 of the 10 AWA-10 classes, and 16 of the 20 classes in ImageNet-20 (see Figure 3).\nThese gains clearly show the impact of injecting semantics into discriminative feature learning. The\nforests\u2019 advantage over the individual trees supports our core claim regarding the value of inter-\nleaving semantic cues from multiple taxonomies. Further, the proposed hierarchical regularization\n(MKL-H) outperforms the generic MKL, particularly for the multiple taxonomy forests.\n\nWe stress that semantic kernel forests\u2019 success is not simply due to having access to a variety of\nkernels, as we can see by comparing our method to both the raw feature MKL and perturbed tree\n\n7\n\n\fBiological (38.33)\n\nAppearance (50.83)\n\nHabitat (43.33)\n\nAll (55.00)\n\ndalmatian\n\n11\n\nS. cat\n\nleopard\n\nwolf\n\n7\n\n4\n\n6\n\nn\na\ni\nt\na\nm\na\nd\n\nl\n\n1\n\n8\n\n3\n\n3\n\nt\na\nc\n \n.\n\nS\n\n11\n\n9\n\n8\n\n12\n\nf\nl\no\nw\n\n7\n\n6\n\n15\n\n9\n\nd\nr\na\np\no\ne\n\nl\n\ndalmatian\n\n12\n\n4\n\nS. cat\n\nleopard\n\nwolf\n\n13\n\n1\n\n2\n\nt\na\nc\n \n.\n\nS\n\n5\n\n3\n\n3\n\nn\na\ni\nt\na\nm\na\nd\n\nl\n\n5\n\n9\n\n4\n\n14\n\nf\nl\no\nw\n\n9\n\n3\n\n22\n\n11\n\nd\nr\na\np\no\ne\n\nl\n\ndalmatian\n\n11\n\nS. cat\n\nleopard\n\nwolf\n\n6\n\n3\n\n0\n\nn\na\ni\nt\na\nm\na\nd\n\nl\n\n5\n\n8\n\n1\n\n4\n\nt\na\nc\n \n.\n\nS\n\n9\n\n12\n\n8\n\n15\n\nf\nl\no\nw\n\n5\n\n4\n\n18\n\n11\n\nd\nr\na\np\no\ne\n\nl\n\ndalmatian\n\n17\n\n3\n\nS. cat\n\nleopard\n\nwolf\n\n13\n\n1\n\n3\n\nt\na\nc\n \n.\n\nS\n\n2\n\n3\n\n2\n\nn\na\ni\nt\na\nm\na\nd\n\nl\n\n5\n\n11\n\n5\n\n15\n\nf\nl\no\nw\n\n5\n\n4\n\n21\n\n10\n\nd\nr\na\np\no\ne\n\nl\n\n(a) Biolog.\n\n(b) Appear.\n\n(c) Habitat\n\n(d) All\n\nchimpanzee\ngiant panda\nleopard\nPersian cat\npig\nhippopotamus\nhumpback whale\nraccoon\nrat\nseal\n\ne\nn\n\ni\nl\n\ne\nf\n\ni\n\nd\nn\no\ny\nc\no\nr\np\n\nc\ni\nt\na\nu\nq\na\n\ne\nr\no\nv\nn\nr\na\nc\n\ni\n\nl\n\na\nt\nn\ne\nc\na\np\n\nl\n\nd\ne\no\nt\n\u2212\nn\ne\nv\ne\n\nt\na\nr\n/\nt\na\nc\n\ns\ns\ne\nl\nr\ni\na\nh\n\na\nd\nn\na\np\n~\n\nl\n\ne\ng\nn\nu\n\nj\n\ni\n\nr\no\nv\na\nh\ne\nb\n\nd\nn\na\n\nl\n\nc\ni\nt\na\nu\nq\na\n\nt\na\nr\n/\nn\no\no\nc\na\nr\n\ne\nc\nn\na\nr\na\ne\np\np\na\n\ny\ne\nr\np\n/\nr\no\nt\na\nd\ne\nr\np\n\n(e) MKL\n\nchimpanzee\ngiant panda\nleopard\nPersian cat\npig\nhippopotamus\nhumpback whale\nraccoon\nrat\nseal\n\nc\ni\nt\na\nu\nq\na\n\nd\nn\na\n\nl\n\nt\na\nt\ni\nb\na\nh\n\ne\nn\n\ni\nl\n\ne\nf\n\ni\n\nd\nn\no\ny\nc\no\nr\np\n\nc\ni\nt\na\nu\nq\na\n\ne\nr\no\nv\nn\nr\na\nc\n\ni\n\nl\n\na\nt\nn\ne\nc\na\np\n\nl\n\nd\ne\no\nt\n\u2212\nn\ne\nv\ne\n\nt\na\nr\n/\nt\na\nc\n\ns\ns\ne\nl\nr\ni\na\nh\n\na\nd\nn\na\np\n~\n\nl\n\ne\ng\nn\nu\n\nj\n\ni\n\nr\no\nv\na\nh\ne\nb\n\nd\nn\na\n\nl\n\nc\ni\nt\na\nu\nq\na\n\nt\na\nr\n/\nn\no\no\nc\na\nr\n\ne\nc\nn\na\nr\na\ne\np\np\na\n\ny\ne\nr\np\n/\nr\no\nt\na\nd\ne\nr\np\n\n(f) MKL-H\n\nl\n\ne\ng\nn\nu\nn\no\nn\n\nj\n\nc\ni\nt\na\nu\nq\na\n\nd\nn\na\n\nl\n\nt\na\nt\ni\nb\na\nh\n\nl\n\ne\ng\nn\nu\nn\no\nn\n\nj\n\nl1\u2212only (34.33)\n\nl1 + Hierarchical (35.67)\n\nFigure 4: (a-d): AWA-4 confusion matrices for individual taxonomies (a-c) and the combined tax-\nonomies (d). Y-axis shows true classes; x-axis shows predicted classes. (e-f): Example \u03b2c\u2019s to show\nthe characteristics of the two regularizers. Each entry is a learned kernel weight (brighter=higher\nweight). Y-axis shows object classes; x-axis shows kernel node names.\n\nresults\u2014all of which use the same number of kernels.\nInstead, the advantage is leveraging the\nimplicit discriminative criteria embedded in the external semantic groupings. In addition, we note\nthat even perturbed taxonomies can be semantic; some of their groupings of classes may happen\nto be meaningful, especially when there are fewer categories. Hence, their advantage over the raw\nfeature kernels is understandable. Nonetheless, perturbed taxonomies are semantically weaker than\nthe originals, and our kernel trees with the true single or multiple taxonomies perform better.\n\nMKL-H has the most impact for the multiple taxonomy forests, and relatively little on the single\nkernel tree. This makes sense. For a single taxonomy, a single kernel is solely responsible for\ndiscriminating a class from the others, making all kernels similarly useful. In contrast, in the forest,\ntwo classes are related at multiple different nodes, making it necessary to select out useful views;\nhere, the hierarchical regularizer plays the role of favoring kernels at higher levels, which might\nhave more generalization power due to the training set size and number of classes involved.\n\nThe per-class and per-taxonomy comparisons in Figure 3 further elucidate the advantage of using\nmultiple complementary taxonomies. A single semantic kernel tree often improves accuracy on\nsome classes, but at the expense of reduced accuracy on others. This illustrates that the structure of\nan individual taxonomy is often suboptimal. For example, the Habitat taxonomy on AWA-10 helps\ndistinguish humpback whale well from the others\u2014it branches early from the other animals due to\nits distinctive \u201coceanic\u201d background\u2014but it hurts accuracy for giant panda. The WordNet taxonomy\ndoes exactly the opposite, improving giant panda via the Biological taxonomy, but hurting hump-\nback whale. The semantic kernel forest takes the best of both through its learned combination. The\nonly cases in which it fails are when the majority of the taxonomies strongly degrade performance,\nas to be expected given the linear MKL combination (e.g., see the classes marimba and rule).\n\nFurther qualitative analysis Figure 4 (a-d) shows the confusion matrices for AWA-4 using only\nthe root level kernels. We see how each taxonomy specializes the features, exactly in the manner\nsketched in Sec. 1. The combination of all taxonomies achieves the highest accuracy (55.00), better\nthan the maximally performing individual taxonomy (Appearance, 50.83). Figure 4 (e-f) shows\nthe learned kernel combination weights \u03b2c for each class c in AWA-10, using the two different\nIn (e), the L1 regularizer selects a sparse set of useful kernels. For example, the\nregularizers.\nhumpback whale drops the kernels belonging to the whole Behavior taxonomy block, and gives the\nstrongest weight to \u201chairless\u201d, and \u201chabitat\u201d. However, by failing to select some of the upper-level\nnodes, it focuses only on the most confusing \ufb01ne-grained problems. In contrast, with the proposed\nregularization (f), we see more emphasis on the upper nodes (e.g., the \u201cbehavior\u201d and \u201cplacental\u201d\nkernels), which helps accuracy.\n\n5 Conclusion\n\nWe proposed a semantic kernel forest approach to learn discriminative visual features that leverage\ninformation from multiple semantic taxonomies. The results show that it improves object recog-\nnition accuracy, and give good evidence that committing to a single external knowledge source is\ninsuf\ufb01cient. In future work, we plan to explore non-additive and/or local per-instance kernel combi-\nnation techniques for integrating the semantic views.\n\nAcknowledgements This research is supported in part by NSF IIS-1065243 and NSF IIS-1065390.\n\n8\n\n\fReferences\n\n[1] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. In CVPR, 2005.\n[2] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-Constrained Linear Coding for Image\n\nClassi\ufb01cation. In CVPR, 2010.\n\n[3] C. Fellbaum, editor. WordNet An Electronic Lexical Database. MIT Press, May 1998.\n[4] A. Zweig and D. Weinshall. Exploiting Object Hierarchy: Combining Models from Different Category\n\nLevels. In ICCV, 2007.\n\n[5] M. Marszalek and C. Schmid. Semantic hierarchies for visual object recognition. In CVPR, 2007.\n[6] A. Torralba, R. Fergus, and W. T. Freeman. 80 million Tiny Images: a Large Dataset for Non-Parametric\n\nObject and Scene Recognition. PAMI, 30(11):1958\u20131970, 2008.\n\n[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A Large-Scale Hierarchical Image\n\nDatabase. In CVPR, 2009.\n\n[8] R. Fergus, H. Bernal, Y. Weiss, and A. Torralba. Semantic label sharing for learning with many categories.\n\nIn ECCV, 2010.\n\n[9] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us?\n\nIn ECCV, 2010.\n\n[10] S. J. Hwang, K. Grauman, and F. Sha. Learning a tree of metrics with disjoint visual features. In NIPS,\n\n2011.\n\n[11] N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair. Learning hierarchical similarity metrics. In CVPR,\n\n2012.\n\n[12] S. Bengio, J. Weston, and D. Grangier. Label Embedding Trees for Large Multi-Class Task. In NIPS,\n\n2010.\n\n[13] J. Deng, S. Satheesh, A. Berg, and L. Fei Fei. Fast and balanced: Ef\ufb01cient label tree learning for large\n\nscale object recognition. In NIPS, 2011.\n\n[14] C. Lampert, H. Nickisch, and S. Harmeling. Learning to Detect Unseen Object Classes by Between-Class\n\nAttribute Transfer. In CVPR, 2009.\n\n[15] M. Marszalek and C. Schmid. Constructing category hierarchies for visual recognition. In ECCV, 2008.\n[16] G. Grif\ufb01n and P. Perona. Learning and using taxonomies for fast visual categorization. In CVPR, 2008.\n[17] T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In\n\nICCV, 2011.\n\n[18] J. Sivic, B. Russell, A. Zisserman, W. Freeman, and A. Efros. Unsupervised discovery of visual object\n\nclass hierarchies. In CVPR, 2008.\n\n[19] E. Bart, I. Porteous, P. Perona, and M. Welling. Unsupervised learning of visual taxonomies. In CVPR,\n\n2008.\n\n[20] L.-J. Li, C. Wang, Y. Lim, D. Blei, and L. Fei-Fei. Building and using a semantivisual image hierarchy.\n\nIn CVPR, 2010.\n\n[21] S. Kim and E. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML,\n\n2010.\n\n[22] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical Correlation Analysis: An Overview with\n\nApplication to Learning Methods. Neural Computation, 16(12), 2004.\n\n[23] A. Blum and T. Mitchell. Combining Labeled and Unlabeled Data with Co-training. In COLT: Proceed-\n\nings of the Workshop on Computational Learning Theory, 1998.\n\n[24] C. Christoudias, K. Saenko, L. Morency, and T. Darrell. Co-adaptation of audio-visual speech and gesture\n\nclassi\ufb01ers. In International Conference on Multimodal Interaction, 2006.\n\n[25] I. Dhillon, S. Mallela, and R. Kumar. A divisive information-theoretic feature clustering algorithm for\n\ntext classi\ufb01cation. Journal of Machine Learning Research, 3:1265\u20131287, 2003.\n\n[26] A. Gupta and S. Dasgupta. Hybrid hierarchical clustering: Forming a tree from multiple views.\n\nIn\n\nWorkshop on Learning With Multiple Views, 2005.\n\n[27] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, 2006.\n[28] N. Loeff and A. Farhadi. Scene Discovery by Matrix Factorization. In ECCV, 2008.\n[29] S. J. Hwang, F. Sha, and K. Grauman. Sharing features between objects and their attributes. In CVPR,\n\n2011.\n\n[30] F. Bach, G. Lanckriet, and M. Jordan. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm.\n\nIn ICML, 2004.\n\n[31] M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In ICCV, 2007.\n[32] P. Gehler and S. Nowozin. On feature combination for multiclass object classi\ufb01cation. In ICCV, 2009.\n[33] K. Weinberger, J. Blitzer, and L. Saul. Distance Metric Learning for Large Margin Nearest Neighbor\n\nClassi\ufb01cation. In NIPS, 2006.\n\n[34] F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, 2008.\n[35] D. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1999.\n[36] S. Boyd and A. Mutapcic. Subgradient methods. 2007.\n[37] O. Russakovsky and L. Fei-Fei. Attribute learning in large-scale datasets. In ECCV, 2010.\n\n9\n\n\f", "award": [], "sourceid": 825, "authors": [{"given_name": "Sung", "family_name": "Hwang", "institution": null}, {"given_name": "Kristen", "family_name": "Grauman", "institution": null}, {"given_name": "Fei", "family_name": "Sha", "institution": null}]}